C-MELT: Contrastive Enhanced Masked Auto-Encoders for ECG-Language Pre-Training
摘要
评审与讨论
This work proposes a multimodal ECG learning framework with multiple objective alignments and evaluates it across various downstream tasks.
优点
Many empirical studies encompassing a wide range of datasets and methods.
缺点
- Lack of novelty: MEM, ETM, and MLM losses are very similar to the BLIP [1] work, which was proposed in 2022. This work directly reimplements the BLIP loss in the ECG domain, which is not novel enough.
- Ambiguous results: In Table 3, C-MELT claims that the zero-shot classification performance across all downstream datasets is 77.71, which is higher than MERL [2]. However, in the ablation results Table 7, the average zero-shot result is 72.5±9.1, which is not consistent with the author-reported performance. Additionally, in MERL [2]'s original paper, the reported average zero-shot performance is 75.24±1.7 in Tables 5-9, which is much higher than C-MELT's performance.
- The reproducibility concern is made worse by the fact the authors don't appear willing to share their code.
[1] Li, Junnan, et al. "Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation." International conference on machine learning. PMLR, 2022. [2] Liu, Che, et al. "Zero-Shot ECG Classification with Multimodal Learning and Test-time Clinical Knowledge Enhancement." Forty-first International Conference on Machine Learning.
问题
- Can the authors clarify the main differences between C-MELT and BLIP? Even though the data is different (ECG vs. Image), the framework and optimization objectives appear too similar.
- How do the authors explain the discrepancy between the reported results in Table 3 and Table 7?
Thank you for your review and your time. We believe there might be some misunderstanding or confusion, which we will hopefully make clearer to you below:
[R4-C1]: Can the authors clarify the main differences between C-MELT and BLIP? Even though the data is different (ECG vs. Image), the framework and optimization objectives appear too similar.
While we agree certain points are shared between the two methods, we would like to highlight notable differences in our (C-MELT) hybrid self-supervised learning approach for ECG-Language representation learning:
- First, our ECG encoder leverages transformer layers specifically optimized for time-series signal processing, while the clinical text is processed using the state-of-the-art pre-trained Flan-T5 text encoder. Our ablation studies validated their effectiveness, where we also demonstrated that Flan-T5 surpasses BERT (BLIP’s text encoder), in capturing clinical ECG text representations.
- Second, BLIP does not support masked Image/ECG modeling in a generative manner. In contrast, our method incorporates MEM, which has been proven effective in existing ECG SSL [2,3,4] for capturing nuanced and detailed representations of ECG signals.
- Third, our new components with Siglep loss effectively work alongside existing losses in masked auto-encoder-based-model, specific to the ECG domain. We further address the inherent limitations of MIMIC-IV ECG regarding data sparsity (Lines 748-755 in our original submission) by introducing N3S in Flan-T5 feature space and FAISS. BLIP does not consider this for their ITC.
- Finally, our work utilizes off-the-shelf LLM (i.e., GPT-4o) to enrich clinical context from category names (e.g. ECG diagnoses), particularly boosting the zero-shot performance in clinical evaluation. Whereas, BLIP's main focus is image-language tasks.
[2] Na, Yeongyeon, et al. "Guiding Masked Representation Learning to Capture Spatio-Temporal Relationship of Electrocardiogram." arXiv preprint arXiv:2402.09450 (2024).
[3] Hu, Rui, Jie Chen, and Li Zhou. "Spatiotemporal self-supervised representation learning from multi-lead ECG signals." Biomedical Signal Processing and Control 84 (2023): 104772.
[4] Zhang, Huaicheng, et al. "Maefe: Masked autoencoders family of electrocardiogram for self-supervised pretraining and transfer learning." IEEE Transactions on Instrumentation and Measurement 72 (2022): 1-15.
[R4-C2]: How do the authors explain the discrepancy between the reported results in Table 3 and Table 7? In Table 3, C-MELT claims that the zero-shot classification performance across all downstream datasets is 77.71, which is higher than MERL [2]. However, in the ablation results Table 7, the average zero-shot result is 72.5±9.1, which is not consistent with the author-reported performance. Additionally, in MERL [2]'s original paper, the reported average zero-shot performance is 75.24±1.7 in Tables 5-9, which is much higher than C-MELT's performance.
The reason for the distinct results in Table 3 and Table 7 is that they are derived from two different settings, to maintain consistency with MERL for fair comparison. Specifically:
-
For Table 3, we incorporated GPT-4o to obtain richer clinical context from category names (Lines 370-372 in our original submission), which reflects the optimal performance C-MELT can achieve.
-
For Table 7 (ablation study), we specifically chose to use raw category names without GPT support (Line 427 in our original submission) to evaluate the true robustness of the method itself. Using GPT could raise concerns about whether improvements are due to the testing components or GPT-4o. Therefore, this setting naturally leads to lower performance but ensures a fair evaluation in the context of our ablation study.
MERL also conducted experiments under the same two settings. With GPT-4 support, MERL achieved ~67 (Zero-shot MERL (GPT4 Generated) of Figure 1 in MERL), which is lower than C-MELT’s ~77.1. In the raw category settings, MERL achieved ~62 (Figure 1), compared to 72.5 in our work.
Note that the Zero-shot MERL (CKEPE) (~75.3 in Figure 1 and Table 5-9) was achieved with an additional database, which requires searching extra attributes and sub-types of each category name. So we did not directly compare with this setting, as further explained in our ablation study (Lines 809-816 in our original submission).
[R4-C3]: The reproducibility concern is made worse by the fact the authors don't appear willing to share their code.
We stated that our code and pre-trained models would be made public upon acceptance (Lines 489-490 in our original submission).
Lastly, we appreciate your paper reference and would like to add it to the related works section in our revision (Line 54). We kindly note that C-MELT is the first work to utilize a hybrid SSL technique for ECG-Language pretraining, with performance surpassing all existing works.
We kindly inquire if you have further questions or require clarification. In our previous response, we detailed key differences between BLIP and our work, clarified misinterpretations regarding evaluation and implementation, and revised the manuscript accordingly. We respectfully invite you to reconsider your assessment based on these updates. Thank you for your time.
Thank you for the detailed rebuttal.
However, my concerns remain for the following reasons:
- The framework is still quite similar to BLIP. Although the authors claim to use an ECG-specific transformer block, processing ECG data with transformers is not a novel idea and has been implemented in prior works several years ago.
- Flan-T5 is not the state-of-the-art (SOTA) text encoder, especially for medical text, as it is not pretrained on a medical corpus.
- Using GPT to enhance prompt quality cannot be attributed as a major contribution since it heavily relies on the capabilities of the large language model (LLM) rather than the proposed framework itself.
Hence, I will maintain my score.
Thank you for your response. We would like to address your points as follows:
The framework is still quite similar to BLIP. Although the authors claim to use an ECG-specific transformer block, processing ECG data with transformers is not a novel idea and has been implemented in prior works several years ago.
We respectfully disagree that our work is “quite similar” to BLIP. The loss functions and the way our method trains a proposed ECG-Language model with the modalities focused on are entirely different usages. Could you please highlight more similarities between BLIP and our work?
Flan-T5 is not the state-of-the-art (SOTA) text encoder, especially for medical text, as it is not pretrained on a medical corpus.
As shown in Table 7, we conducted an extensive ablation study on different text encoders. This demonstrates that the recent advancements in Flan-T5 surpass both BERT (used in BLIP) and Med-CPT [5] (which is pre-trained on biomedical data and also used in MERL as a SOTA model). In addition, our framework remains flexible, and other future text encoders can be easily replaced to accommodate future developments if desired to get even better performance.
[5] Jin, Qiao, et al. "MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval." Bioinformatics 39.11 (2023): btad651.
Using GPT to enhance prompt quality cannot be attributed as a major contribution since it heavily relies on the capabilities of the large language model (LLM) rather than the proposed framework itself.
First, we would like to emphasize again that using GPT is one insight that we used for enhancing zero-shot capability and to fairly compare against MERL in the same setting (note that without and with GPT, we both surpass them). Second, for linear probing and full fine-tuning scenarios, GPT is not used at all, and our method shows strong performance over existing benchmarks.
Finally, we appreciate your review. However, together with the above responses and our method demonstrating robustness and generalization, outperforming existing approaches across five datasets with over 100 cardiac conditions in the ECG-healthcare domain, we kindly request your reevaluation of our work. Thank you again for your time, and we look forward to your reconsideration.
The present manuscript proposes multiple interesting ideas for enhancing ECG-language pre-training by combining contrastive learning and reconstruction. By doing so, considerable improvements over unimodal pretraining and a recent multimodal approach are shown on multiple downstream datasets.
优点
Strengths of the manuscript include the introduction of multiple novel ideas, in particular the combination of learning via contrastive and generative generation. Furthermore, strong results are obtained across datasets. The authors clearly explain the various parts of the approach.
缺点
The manuscript could benefit from a clearer rationale behind the choice of methods, as theoretical justifications are somewhat limited. While several ideas are presented, a more thorough exploration of the reasoning behind each would strengthen the approach. Additionally, although some ablation analyses have been performed, it would be helpful to clarify which components (e.g., reconstruction vs. no reconstruction) play an essential role in the downstream tasks.
问题
For the experiments:
- Whereas the proposition to combine reconstruction and contrastive learning is very worthwhile, to my understanding this matter is not directly addressed in the paper. While the Siglep loss is ablated, I believe in this case the model still is not trained solely via reconstruction, as ECG-Text Matching (ETM) is active. I believe the manuscript would be considerably improved if results could be shown under a setting of ‘reconstruction only’ and ‘without reconstruction’. Ideally, under the ‘reconstruction only’ setting, additional information is provided about the utility of text or ECG reconstruction. In the ‘without reconstruction’ setting, it would be interesting to ablate either the Siglep loss and/or ETM. The authors are free to provide alternative analyses than specifically those considered here, but again, gaining an understanding of the contribution of the individual parts of the approach would be important.
- Why did the authors choose not to compare against other methods for ECG-text pretraining such as the cited Lalam et al. 2023 or Liu et al. 2023 (https://arxiv.org/pdf/2309.07145)?
In the text, the following sections could improve with clearer justifications of the authors work:
- Lines 60-68: Here the authors propose to combine contrastive and generative approaches to leverage their complementary strengths, but it is unclear what these complementary strengths are expected to be. As no clear argumentation is provided why these approaches should be combined, there is no framework in place to later interpret the results. I would therefore recommend to provide a more precise description of the working hypothesis.
- Lines 240-243: “This can hinder the model’s capability to learn discriminative features (…)”. What is being referred to here? The fact that generative approaches perform reconstruction and contrastive learning distinguishes between data pairs is, to my understanding, precisely what causes them to learn discriminative features to the extent that they do. Next, it is suspected that ETM could serve as a contrastive loss but that it may be insufficient as it performs binary classification with fused features. As these characteristics of ETM were introduced with little justification, one is left wondering why the Siglep loss and ETM are both needed. Specifically, the hypothesis that alignment of both unfused and fused features is beneficial is currently not answered in the paper, as ablations do not seem to include ETM.
- Lines 274-275: I would like to ask why the authors claim nearest-neighbour negative sampling makes negative samples challenging? Does the approach not make negative samples easier by selecting less similar reports as negatives? (And as reports with high cosine distance are selected, is this not opposite to the principle of ‘nearest-neighbours’?)
Thank you for your thoughtful feedback and for highlighting where our manuscript can be strengthened! Please find our responses to all your points below:
[R3-C1]: Whereas the proposition to combine reconstruction and contrastive learning is very worthwhile, to my understanding this matter is not directly addressed in the paper. While the Siglep loss is ablated, I believe in this case the model still is not trained solely via reconstruction, as ECG-Text Matching (ETM) is active. I believe the manuscript would be considerably improved if results could be shown under a setting of ‘reconstruction only’ and ‘without reconstruction’. Ideally, under the ‘reconstruction only’ setting, additional information is provided about the utility of text or ECG reconstruction. In the ‘without reconstruction’ setting, it would be interesting to ablate either the Siglep loss and/or ETM. The authors are free to provide alternative analyses than specifically those considered here, but again, gaining an understanding of the contribution of the individual parts of the approach would be important.
We appreciate your time with those detailed comments. To address your points, we agree that additional results under specific settings: “reconstruction only” or “without reconstruction,” would further clarify the contribution of individual components. Please refer to the table below.
In the “without reconstruction” suggestion, we believe this is aligned with [R2-C1] regarding the impact of the reconstruction aspect on overall performance. Specifically, in our supporting experiment (Table below), incorporating MLM and MEM noticeably improves performance across all evaluated datasets. Particularly, gains are observed in PTBXL-Super (+5.9%), CODE-Test (+2.2%), demonstrating that the reconstruction tasks also help enhance the model's ability for better performance, aligned with our motivation.
| PTBXL-Super | PTBXL-Form | CSN | CODE-Test | |
|---|---|---|---|---|
| w/o MLM + MEM | 70.3 | 67.4 | 74.5 | 94.6 |
| w MLM + MEM | 76.2 | 66.1 | 76.3 | 96.8 |
Regarding the “reconstruction only” setting, we haven’t reported the case where we eliminated both ETM and Siglep modeling. However, when we did our ablation study (Table 6, Row 4 - Without N3S and Siglep), the overall performance already decreased noticeably. Note that zero-shot results are not reported when Siglep is not activated (as described in Lines 742-747 in our original submission).
We hope our provided results with and without masking modeling (e.g. MLM, MEM) settings satisfy your question.
[R3-C2]: Why did the authors choose not to compare against other methods for ECG-text pretraining such as the cited Lalam et al. 2023 or Liu et al. 2023 (https://arxiv.org/pdf/2309.07145)?
While they also provide a solution to ECG-Text pretraining, there are reasonable points that we do not directly compare with their results. Specifically:
- Lalam et al. (2023) heavily rely on a huge private dataset, which limits reproducibility and comparability with other methods. Additionally, the lack of benchmarking on diverse diagnoses, along with limited exploration of generalization and adaptability (e.g. Zero-shot setting), makes the method challenging to assess and compare in a broader context.
- Liu et al. (2023) utilized relatively small datasets for both pretraining and downstream evaluations, which limits insights into its generalizability. Additionally, their pretraining and reported evaluation rely on the same datasets (e.g., PTB-XL and CPSC2018), even in zero-shot tasks leading to a validity concern about their performance. Even when comparing their results with C-MELT, C-MELT significantly outperforms with 76.2 and 80.1, compared to their 54.6 and 57.1 (zero-shot AUC on PTB-XL-super and CPSC2018, respectively).
[R3-C3]: Lines 60-68: Here the authors propose to combine contrastive and generative approaches to leverage their complementary strengths, but it is unclear what these complementary strengths are expected to be. As no clear argumentation is provided why these approaches should be combined, there is no framework in place to later interpret the results. I would therefore recommend to provide a more precise description of the working hypothesis.
While we have mentioned the complementary strengths of contrastive and generative approaches in Lines 141-147 in our original submission, we agree that this may not have been sufficiently explicit. The contrastive approach enhances discriminative alignment between ECG and text, which improves cross-modal understanding. Meanwhile, the generative approach helps capture fine-grained features within each modality by reconstructing missing components, ensuring that the learned representations are robust and detailed. Together, these complementary methods allow our model to excel at both modality-specific and cross-modal tasks. We made slight changes in our revised introduction section for this clarification.
We also emphasize that our empirical results strongly support the effectiveness of combining these approaches, as demonstrated in both the main results and ablation studies. Thank you again for helping us improve the clarity of our manuscript.
[R3-C4]: Lines 240-243: “This can hinder the model’s capability to learn discriminative features (…)”. What is being referred to here? The fact that generative approaches perform reconstruction and contrastive learning distinguishes between data pairs is, to my understanding, precisely what causes them to learn discriminative features to the extent that they do. Next, it is suspected that ETM could serve as a contrastive loss but that it may be insufficient as it performs binary classification with fused features. As these characteristics of ETM were introduced with little justification, one is left wondering why the Siglep loss and ETM are both needed. Specifically, the hypothesis that alignment of both unfused and fused features is beneficial is currently not answered in the paper, as ablations do not seem to include ETM.
We agree there may have been confusion here, and we clarified this in our revised manuscript (Lines 238-239). Specifically, we intended to highlight that MAE-based models are often more biased toward generative self-supervised learning, which limits their capacity to perform contrastive tasks effectively, such as supporting zero-shot inference.
Regarding the ETM loss, ETM aligns ECG and text pairs at the fused feature level but does not directly enhance the discriminative power of individual encoders, which is addressed by Siglep. However, additional results confirm ETM’s complementary role in guiding fused feature space learning, supporting the generative components in our hybrid approach. We highlight it in context with [R1-C25], [R2-C3], and [R2-C4]:
| PTBXL-Super | PTBXL-Form | CSN | CODE-Test | |
|---|---|---|---|---|
| w/o ETM | 73.2 | 65.8 | 76.6 | 96.2 |
| w ETM | 76.2 | 66.1 | 76.3 | 96.8 |
As can be seen, removing ETM slightly decreases performance across most datasets, particularly in PTBXL-Super (76.2 to 73.2).
[R3-C5]: Lines 274-275: I would like to ask why the authors claim nearest-neighbour negative sampling makes negative samples challenging? Does the approach not make negative samples easier by selecting less similar reports as negatives? (And as reports with high cosine distance are selected, is this not opposite to the principle of ‘nearest-neighbours’?)
Thank you for referring to this point. We would like to make this clearer as follows together with [R1-C18] and [R1-C23] from Reviewer xh2S:
- First, for a given ECG text report in half of a batch, we look for the top 64 text reports with the largest cosine distances (e.g. use features in small-FlanT5 space) from the training set. It is worth noting that the features were already calculated and indexed before the training process using FAISS.
- Second, to make the training process challenging as a batch dynamically updated/changed, we randomly chose one from those 64 samples to replace the current text report.
Thanks to the authors for answering my questions and giving more detail on the with and without masking modeling. I already adapted my score!
This paper introduces C-MELT, a cross-modal pre-training framework for self-supervised learning of ECG signals and textual reports. C-MELT combines the strengths of generative and contrastive learning through masked language and ECG modeling to reconstruct data. It also leverages contrastive loss functions and a nearest-neighbor negative sampling strategy to improve cross-modal alignment. Extensive experiments across multiple public datasets demonstrate that C-MELT significantly outperforms existing methods, especially in downstream tasks like zero-shot learning and linear probing, showcasing strong generalization capabilities and diagnostic potential.
优点
- Overall, this paper makes clear contributions, especially the comparative interpretation with MERL, which further promotes the development of multimodal ECG research.
- N3S provides an effective negative sample selection strategy to effectively select negative text reports in medical datasets.
缺点
- The ablation experiment does not discuss the role of different components in the reconstruction method. The contrast method is currently the main method for multimodal alignment in ECG multimodal researches. This paper proposes a hybrid method to enhance the performance, but does not discuss the ablation experiment of the added reconstruction method in more detail, which leads to doubts about the effectiveness of the added reconstruction method.
- The layout of Table 2 is somewhat inadequate, as it differs significantly from the original table in MERL, leading to some confusion when reading. Given that this table may serve as a benchmark in the future, it should be adjusted for consistency to facilitate easier comparison.
问题
- Please provide more details on the performance improvement achieved by the reconstruction approach.
- In Section 3.1, positional encoding is added at the end. How is sequence information ensured during the calculation of multi-head attention according to the content of the article?
- If ECG-Text Matching (ETM) promotes the alignment of corresponding ECG-Text pairs, why does it not enhance the discriminative ability of the encoders?
- Conversely, if Siglep Loss can accomplish contrastive learning for multimodal alignment, what is the purpose of retaining ETM?
伦理问题详情
none
We appreciate your review and understand the concern raised regarding the effectiveness of the added reconstruction method. Regarding the minor point with the layout of Table 2, we adjusted it in our revised manuscript for easier comparison. Please find our answers to the points raised below:
[R2-C1]: Please provide more details on the performance improvement achieved by the reconstruction approach.
This is a great suggestion and we would like to provide our additional experiments with and without using reconstruction parts (e.g. MLM, MEM), and hopefully, this will be sufficient for your evaluation. Please find our additional results below:
| PTBXL-Super | PTBXL-Form | CSN | CODE-Test | |
|---|---|---|---|---|
| w/o MLM + MEM | 70.3 | 67.4 | 74.5 | 94.6 |
| w MLM + MEM | 76.2 | 66.1 | 76.3 | 96.8 |
We can see that incorporating MLM and MEM noticeably improves performance across all evaluated datasets. Especially, gains are observed in PTBXL-Super (+5.9%), and CODE-Test (+2.2%), demonstrating that the reconstruction tasks also help enhance the model's ability for better performance, aligned with our motivation.
[R2-C2]: In Section 3.1, positional encoding is added at the end. How is sequence information ensured during the calculation of multi-head attention according to the content of the article?
This is indeed an insightful observation in our manuscript! We agree that there was an overlook in our presentation on positional encoding. Accordingly, our implementation uses convolutional positional encoding before the TransformerEncoder layers as expected, ensuring that temporal information is preserved during attention calculations. Therefore, we adjusted to correct this in our revised manuscript in Lines 169-170. Thank you for helping us clarify this detail.
[R2-C3]: If ECG-Text Matching (ETM) promotes the alignment of corresponding ECG-Text pairs, why does it not enhance the discriminative ability of the encoders?
Thank you for pointing this out. First, we would like to clarify that we have not explicitly stated that ETM does "not enhance the discriminative ability of the encoders." In our manuscript (Lines 227-228 in our original submission), we mentioned that ETM is designed to "promote alignment between ECG signals and their corresponding text reports." Additionally, in Appendix Lines 742-747, we discussed that ETM operates as a binary classification task at the fused feature level, rather than directly enhancing the discriminative power of individual modality encoders. ETM focuses on determining whether a specific ECG and text pair match, which is critical for cross-modal alignment. However, it does not directly improve the encoders' ability to distinguish between different ECGs or text reports independently. This limitation arises because ETM optimizes at the level of paired features, not at the modality-specific feature granularity needed for downstream tasks that require high intra-class discrimination.
[R2-C4]: Conversely, if Siglep Loss can accomplish contrastive learning for multimodal alignment, what is the purpose of retaining ETM?
We agree that our Siglep loss accomplished stronger and more effective impacts on discriminative modality representation learning. However, ETM does align ECG and text pairs with its nature, which primarily supports/guides the fused feature space learning, together with generative aspects from MLM and MEM. This aligns with the motivation of the hybrid approach from the beginning of our work.
We hope this message finds you well. We kindly ask if any remaining concerns require further clarification or justification. In case you find our work better satisfactory given our previous responses, we would greatly appreciate your feedback on our submission. Thank you for your time and consideration.
Thank you for the authors' responses. I have read them carefully and have decided to maintain my original score.
The proposed approach aims to integrate generative pre-training with robust cross-modal representation learning. The paper extends the previous MERL[1] methodology developed for multimodal learning on ECG records and associated reports, providing zero-shot classification capabilities. The main contributions are the integration of predicting masked segments for both ECG signal and textual reports, specialized loss functions, and an improved negative sampling strategy for cross-modal alignment. The performance is evaluated for diverse datasets and demonstrates superior performance to existing SSL (Self-supervised learning) approaches.
优点
The strength of the model is the ability to provide improved performance compared to MERL[1], achieved through the specialized loss functions.
缺点
The method may not be directly comparable to SSL methodologies as the pretraining is exposed to cardiac diagnostic texts enhancing the performance for related tasks. Performance for novel features, unrelated to the context of the reports (e.g. age and sex of the subject) may be impacted adversely.
问题
Methodology:
Please clearly list the main contributions of the current work in the context of the previous literature, with an accurate formulation of the specific problem addressed.
Please explicitly state whether the primary contribution is zero-shot classification or learning generalized ECG features, and how your work advances the field.
The introduction mentions the scarcity of labeled data but integrating textual reports cannot be regarded as truly “unlabelled” since converting textual descriptions to labels is a trivial task, looking at the examples provided. Please clarify the definition of "unlabeled data" in this context and discuss how the approach differs from traditional supervised learning using text-derived labels.
Please categorize your work as self-supervised, semi-supervised, or supervised representation learning and explain why it fits that category, given the use of textual reports.
Does the implementation necessitate multi-modal data or can it also incorporate a mix of ECG-only and multi-modal data?
The methodology may also depend on the accuracy and the distribution of the textual reports. Are the reports in the pretraining dataset automatically generated or written by cardiologists?
The addition of “diagnostic data” introduces the possibility of learning data bias. What has been done to ensure generalization?
It would be a good test for generalization to predict concepts not addressed in the textual reports e.g. if age and sex are not mentioned in the reports then a supervised task can be performed for these tasks, as these labels are readily available in most datasets.
How does the proposed approach differ from other hybrid approaches combining generative and contrastive methodologies e.g. [2]?
Is the text encoder frozen or fine-tuned?
The method may not be directly comparable to SSL methodologies as the pretraining is exposed to cardiac diagnostic texts enhancing the performance for related tasks. Performance for novel features, unrelated to the context of the reports (e.g. age and sex of the subject) may be impacted adversely.
Please explain in further detail about where the N3S is applied as there are two cross-modal alignment losses. Is the cross-model alignment performed with single or multiple negative samples? If N3S avoids similar negative reports then the use of several positives and negatives in the batch can be incorporated in future work to avoid an expensive search for dissimilar negatives and looking explicitly for more distant reports is counter-intuitive to the goal of contrastive learning.
What is the difference between SigLEP and ETM losses? Don't they both serve the same purpose? Clearly explain whether the alignment is performed between projections of both modes, joint representation, or the inputs and how the positive and negative pairs are defined.
Does the application also provide the possibility of automatic ECG report generation?
The comparative approaches are not exposed to labels while textual reports may include the labels or similar concepts during pre-training.
Minor suggestions:
Page 1 line 23: Abstract: “achieving 15% and 2% increases.” Please mention the specific context of the percentages mentioned.
Page 2 lines (55-56): Introduction “While some recent efforts (Liu et al., 2024; Lalam et al., 2023; Li et al., 2024)”. Is the Li et al., 2024 repeated or have you missed the correct reference?
Page 2 lines (73-76): “Additionally, we introduce a nearest-neighbor negative sampling … contextually relevant and challenging.” How is the sampling “contextually relevant and challenging”?
Page 2 lines (77-85): No need to explain the test setup in the introduction. Please move to Section 4.1.
Page 3: Siglep is mentioned without explaining the acronym and the cited paper only describes “SigLIP”. Is it Siglep author’s implementation based on “SigLIP” or used in previous literature? If the latter, then please reference the correct source.
Page 4 line 165: Please explain the masking and reconstruction details e.g. is the masking only performed on particular leads similar to RLM in [3] and then the leads are reconstructed? Or is it also applied to segments of the same lead?
Page 4 line (206-215): Is the MEM loss computed in the feature space, not the signal space? If the former then how does the reconstruction loss take into account the quality of the generated signals? There may be data leakage from the features that are input to the decoder as the network may learn trivial reconstruction.
Page 6 line (273-274): “This makes the negative samples to be both challenging and distinct for effective contrastive learning”. The N3S technique for finding negative samples looks specifically for the most distinct reports. How does it make it more “challenging”? Also, the implementation is not very clear. Is the contrastive loss not considering the rest of the batch as negatives?
Page 6 line 313: “Our proposed model is developed based on the fairseq-signals framework in our work” The meaning of the sentence is not clear and the reference for the fairseq-signals framework is not included. If it means that is the author’s prior work then it might be a violation of the double-blind review process as the fairseq-signals framework is implemented on Git Hub.
Page 8 line 429: Have you tested removing the ETM loss?
Page 8 line (429-438): What is the supervised task?
Page 9 line (470): Have you considered using more than 8 eight transformer layers? How does the model size vary for the different architectures?
Tables: The particular metrics used are not specified in the table captions throughout the paper. Please mention the labeled dataset, the training configuration (linear probe or fine-tune), and the score in all table captions.
Tables: Are the comparisons in Tables 1 to 4 based on the author's implementation of other approaches? If so the performance for other works properly optimized for training hyperparameters for both the pre-training and supervised training? If not then do the cited references include the particular evaluation?
Table 5: What is the metric being compared and why is the student seemingly better than a cardio resident? Please verify the scores.
Results: Some of the classes in the PTB-XL dataset have very few samples so for 1% and 10% random training samples there may not be sufficient positive samples in the train split. That can explain why SSL methodology loses performance significantly for labels other than superclasses.
Figures: Please improve the figure captions and describe in more detail what is being shown.
References:
[1] Che Liu, Zhongwei Wan, Cheng Ouyang, Anand Shah, Wenjia Bai, and Rossella Arcucci. Zero-shot ecg classification with multimodal learning and test-time clinical knowledge enhancement. arXiv preprint arXiv:2403.06659, 2024.
[2] Song, J., Jang, J. H., Lee, B. T., Hong, D., Kwon, J. M., & Jo, Y. Y. (2024). Foundation Models for Electrocardiograms. arXiv preprint arXiv:2407.07110.
[3] Jungwoo Oh, Hyunseung Chung, Joon-myoung Kwon, Dong-gyun Hong, and Edward Choi. Lead-agnostic self-supervised learning for local and global representations of electrocardiogram. In Conference on Health, Inference, and Learning, pp. 338–353. PMLR, 2022.
We thank the reviewer for their time and constructive feedback on our submission. We would like to address your points below:
[R1-C1]: Please clearly list the main contributions of the current work in the context of the previous literature, with an accurate formulation of the specific problem addressed.
[R1-C2]: Please explicitly state whether the primary contribution is zero-shot classification or learning generalized ECG features, and how your work advances the field.
[R1-C4]: Please categorize your work as self-supervised, semi-supervised, or supervised representation learning and explain why it fits that category, given the use of textual reports.
Firstly, we propose a hybrid self-supervised learning approach specifically designed for ECG-language pretraining. We would like to outline our main contributions as:
- We propose a transformer-based ECG encoder to process ECG signals and investigate the usage of pre-trained Flan-T5 as the text encoder to deal with clinical text reports.
- We propose contrastive components and integrated Siglep loss in masked autoencoder-based models, learning together with MEM, MLM, and ETM losses, enabling the model to learn robust and effective modality representations.
- We propose a novel N3S technique to handle the inherent data sparsity in the MIMIC-IV ECG dataset, improving the quality of negative samples and overall model performance.
- We conduct extensive experiments including zero-shot, linear probing, and fully fine-tuned settings to showcase the robustness of our method, surpassing diverse benchmarks on over 100 cardiac conditions.
Finally, we highlight our work in learning generalized ECG features through a generative-contrastive SSL approach, which enables strong generalization across diverse tasks and datasets, where either fine-tuned or zero-shot settings benefit.
[R1-C5]: Does the implementation necessitate multi-modal data or can it also incorporate a mix of ECG-only and multi-modal data?
Our implementation is highly adaptable and can handle both multi-modal and ECG-only data. For example, the text encoder could theoretically be repurposed as an ECG encoder so it is not strictly dependent on the availability of textual data and can generalize effectively to ECG-only cases.
[R1-C6]: The methodology may also depend on the accuracy and the distribution of the textual reports. Are the reports in the pre-training dataset automatically generated or written by cardiologists?
The textual reports in our pre-training dataset (MIMIC-IV-ECG) are automatically generated. It is important to note that this dataset is uniquely large and publicly available, making it highly suitable for robust self-supervised learning in ECG-language pre-training.
[R1-C7]: The addition of “diagnostic data” introduces the possibility of learning data bias. What has been done to ensure generalization?
We acknowledge this insightful comment for “diagnostic data”. Regarding generalization, our empirical experiments demonstrate the robustness of our framework across diverse tasks on five datasets, including zero-shot and fine-tuning scenarios.
[R1-C9]: How does the proposed approach differ from other hybrid approaches combining generative and contrastive methodologies e.g. [2]?
We thank the reviewer for referring to this recently published paper. Their method is heavily based on a single ViT-based architecture for both contrastive and generative SSL, which aims to deal with ECG-only, instead of incorporating any text or other modalities. Therefore, there is a limit to its applicability in zero-shot capabilities and clinical contexts where decisions often rely on both signal and textual interpretations (e.g., radiology reports, patient histories); or advanced applications (e.g., retrieval, report generation).
Additionally, although their model was pre-trained on huge combined datasets (e.g., MIMIC-IV, CODE15, UK Biobank, SAMI, IKEM, totaling ~1.3 million ECG signals), the downstream evaluations only deal with just a few diagnoses, namely MI, STTC, CD, and HYP. In comparison, our model (pre-trained on MIMIC-IV only) is proven to robustly generalize the performance across multiple datasets and over 100 cardiac conditions. We referred to this paper for our revised manuscript in Line 54.
[R1-C10]: Is the text encoder frozen or fine-tuned?
Our text encoder is fine-tuned during the pre-training step. We made changes in Line 179 to better highlight this.
[R1-C14]: Does the application also provide the possibility of automatic ECG report generation?
Our method can be extended to support text/report generation or ECG question answering when the model is fine-tuned on the given task.
[R1-C13]: What is the difference between SigLEP and ETM losses? Don't they both serve the same purpose? Clearly explain whether the alignment is performed between projections of both modes, joint representation, or the inputs and how the positive and negative pairs are defined.
[R1-C12]: Please explain in further detail where the N3S is applied as there are two cross-modal alignment losses. Is the cross-model alignment performed with single or multiple negative samples? If N3S avoids similar negative reports then the use of several positives and negatives in the batch can be incorporated in future work to avoid an expensive search for dissimilar negatives and looking explicitly for more distant reports is counter-intuitive to the goal of contrastive learning.
Siglep and ETM both support contrastive modeling, as already mentioned in our paper in Lines 244-250. We would like to highlight the important points between them below:
- SigLEP Loss: Operates at the modality-specific feature level to perform contrastive alignment between ECG and text embeddings. Positive pairs are aligned ECG-text inputs, while negatives are mismatched pairs prepared through the N3S process. This ensures robust modality-specific alignment in a shared feature space, enhancing the discriminative power of the individual encoders.
- ETM Loss: Works at the joint fused feature level, performing binary classification to determine whether a given ECG-text pair matches. The pairing of positive and negative samples is the same as in SigLEP.
We would like to clarify N3S as follows: Our batches first have all positive pairs. During the data loader stage, half the batch is maintained with positive pairs while the other half will have texts replaced randomly by one of the top 64 farthest distance samples (look for in Flan-T5 space using FAISS).
[R1-C3]: The introduction mentions the scarcity of labeled data but integrating textual reports cannot be regarded as truly “unlabelled” since converting textual descriptions to labels is a trivial task, looking at the examples provided. Please clarify the definition of "unlabeled data" in this context and discuss how the approach differs from traditional supervised learning using text-derived labels.
We acknowledge that the integration of textual reports might imply a form of labeling. However, our approach treats these reports as rich, auxiliary information rather than explicit labels. This distinction is crucial:
In our context, "unlabeled data" refers to raw ECG signals without predefined categorical labels. The textual reports provide descriptive information but do not directly translate to discrete labels typically used in supervised learning. Furthermore, the labeling requires clinicians' involvement which is costly and time-consuming. We pre-trained the model using the MIMIC-IV-ECG dataset that contains ECGs and machine-generated text reports.
Our method leverages self-supervised learning by using the semantic richness of textual data to enhance ECG representation without imposing predefined categories. This contrasts with prior supervised methods that require explicit labeling, often leading to the loss of nuanced information in the reports.
[R1-C8]: It would be a good test for generalization to predict concepts not addressed in the textual reports e.g. if age and sex are not mentioned in the reports then a supervised task can be performed for these tasks, as these labels are readily available in most datasets.
[R1-C11]: The method may not be directly comparable to SSL methodologies as the pretraining is exposed to cardiac diagnostic texts enhancing the performance for related tasks. Performance for novel features, unrelated to the context of the reports (e.g. age and sex of the subject) may be impacted adversely.
[R1-C15]: The comparative approaches are not exposed to labels while textual reports may include the labels or similar concepts during pre-training.
First, our pre-trained model under evaluation with zero-shot or fine-tuned (no text needed, e.g. classification and Identification) experiments deals with unseen ECG recordings and classification tasks. This is also done in MERL work.
Second, we clarify that our approach is not explicitly exposed to labels during pretraining. While textual reports (generated from machine, in MIMIC IV) may inherently include various clinical concepts, this is a characteristic of clinical language and not a direct use of labeled data.
It’s also valid to say that this concern is understandable given that large models (e.g. LLama) have been pre-trained to handle zero-shot learning, by leveraging extensive contextual knowledge and databases, which can sometimes include label-adjacent concepts implicitly learned during pretraining.
[R1-C16]: Page 1 line 23: Abstract: “achieving 15% and 2% increases.” Please mention the specific context of the percentages mentioned.
Thank you for pointing this out. The 15% increase refers to the average performance improvement in 1% linear probing experiments (compared to MERL), and 2% increase in the main zero-shot learning evaluation (compared to MERL). We slightly revised the abstract to clarify these contexts explicitly in Line 24.
[R1-C17]: Page 2 lines (55-56): Introduction “While some recent efforts (Liu et al., 2024; Lalam et al., 2023; Li et al., 2024)”. Is the Li et al., 2024 repeated or have you missed the correct reference?
They are two different related works [4,5].
- [4] Liu, Che, et al. "Zero-shot ecg classification with multimodal learning and test-time clinical knowledge enhancement." arXiv preprint arXiv:2403.06659 (2024).
- [5] Li, Jun, et al. "Frozen language model helps ecg zero-shot learning." Medical Imaging with Deep Learning. PMLR, 2024.
[R1-C19]: Page 2 lines (77-85): No need to explain the test setup in the introduction. Please move to Section 4.1.
We adjusted the manuscript accordingly in Lines 74-77 to improve its structure and readability.
[R1-C20]: Page 3: Siglep is mentioned without explaining the acronym and the cited paper only describes “SigLIP”. Is it Siglep author’s implementation based on “SigLIP” or used in previous literature? If the latter, then please reference the correct source.
We implemented Siglep (Sigmoid Language ECG Pretraining) based on SigLIP. We clarified this in Lines 139 and 246.
[R1-C21]: Page 4 line 165: Please explain the masking and reconstruction details e.g. is the masking only performed on particular leads similar to RLM in [3] and then the leads are reconstructed? Or is it also applied to segments of the same lead?
We use the RLM as an on-the-fly augmentation approach following [3], where masking is applied to entire leads rather than segments within a lead to mimic the setting/task of using various lead combinations. Specifically, each lead is randomly masked with a probability of p=0.5 during pretraining. Here, we do not reconstruct the masked leads. Instead, we have a dropout layer on the input with p=0.1 to enable masking modeling. We added this context to our revised manuscript for clearer interpretation in Lines 164-166.
[R1-C22]: Page 4 line (206-215): Is the MEM loss computed in the feature space, not the signal space? If the former then how does the reconstruction loss take into account the quality of the generated signals? There may be data leakage from the features that are input to the decoder as the network may learn trivial reconstruction.
We calculated MEM loss in a signal space using mean squared error loss. We clarified this explicitly in the revised manuscript to avoid ambiguity in Lines 208-209. Thank you for helping us address this.
[R1-C18]: Page 2 lines (73-76): “Additionally, we introduce a nearest-neighbor negative sampling … contextually relevant and challenging.” How is the sampling “contextually relevant and challenging”?
[R1-C23]: Page 6 line (273-274): “This makes the negative samples to be both challenging and distinct for effective contrastive learning”. The N3S technique for finding negative samples looks specifically for the most distinct reports. How does it make it more “challenging”? Also, the implementation is not very clear. Is the contrastive loss not considering the rest of the batch as negatives?
Thank you for mentioning them. We would like to explain them clearly as follows:
- First, for a given ECG text report in half of a batch, we look for the top 64 text reports with the largest cosine distances (e.g. use features in small-FlanT5 space) from the training set. It is worth noting that the features were already calculated and indexed before the training process using FAISS.
- Second, to make the training process challenging as a batch dynamically updated/changed over forwarding steps, we randomly chose one from those 64 samples to replace the current text report.
We made changes in our revised manuscript regarding this context in Lines 269-271.
[R1-C24]: Page 6 line 313: “Our proposed model is developed based on the fairseq-signals framework in our work” The meaning of the sentence is not clear and the reference for the fairseq-signals framework is not included. If it means that is the author’s prior work then it might be a violation of the double-blind review process as the fairseq-signals framework is implemented on Git Hub.
We clarify that we are not the authors of the fairseq-signals framework. It is a widely used tool for ECG self-supervised learning. We added a proper reference to their implementation in the revised manuscript (End of Page 6).
[R1-C25]: Page 8 line 429: Have you tested removing the ETM loss?
We haven’t reported the impact of ETM loss in our original manuscript. Therefore, we provide additional results with and without ETM experiments to illustrate its contribution to our pipeline, as shown below. Specifically, removing ETM slightly decreases performance across most datasets, particularly in PTBXL-Super (76.2 to 73.2), highlighting its role in improving ECG-text alignment. However, the effect on CSN is minimal, suggesting dataset-specific sensitivity to ETM.
| PTBXL-Super | PTBXL-Form | CSN | CODE-Test | |
|---|---|---|---|---|
| w/o ETM | 73.2 | 65.8 | 76.6 | 96.2 |
| w ETM | 76.2 | 66.1 | 76.3 | 96.8 |
[R1-C26]: Page 8 line (429-438): What is the supervised task?
The configuration for the supervised task has been detailed in Lines 423-428 (in our original submission) and Appendix Table 9. However, to enhance clarity, we have explicitly addressed this point in our revised manuscript on Line 436.
[R1-C27]: Page 9 line (470): Have you considered using more than eight transformer layers? How does the model size vary for the different architectures?
Thank you for mentioning this out. We already presented our model scaling ability in the ablation study (Table 8). However, we would like to further extend transformer layers to 12 and report results in the full zero-shot setting to compare directly with our proposed C-MELT results (#Layer=8) below:
| # Layers | PTBXL-Super | PTBXL-Form | CSN | CODE-Test |
|---|---|---|---|---|
| 8 | 76.2 | 66.1 | 76.3 | 96.8 |
| 12 | 76.4 | 69.5 | 77.9 | 97.5 |
As shown, increasing the number of layers consistently improves performance across all evaluated datasets, with notable gains in PTBXL-Form (+3.4%) and CSN (+1.6%). This better confirms our scaling strategy within the ECG encoder.
[R1-C28]: Tables: The particular metrics used are not specified in the table captions throughout the paper. Please mention the labeled dataset, the training configuration (linear probe or fine-tune), and the score in all table captions.
[R1-C32]: Figures: Please improve the figure captions and describe in more detail what is being shown.
We appreciate the suggestion. While the metrics, labeled datasets, and training configurations (e.g., linear probe or fine-tuning) are already described in the experimental configurations (Lines 363-372 in our original submission) and highlighted in Table 9 (Appendix), we agree that adding more details to the table captions would improve clarity and make it easier for readers to follow.
[R1-C29]: Tables: Are the comparisons in Tables 1 to 4 based on the author's implementation of other approaches? If so the performance for other works properly optimized for training hyperparameters for both the pre-training and supervised training? If not then do the cited references include the particular evaluation?
The comparisons in Tables 1 to 4 are not based on our implementation of other approaches. Instead, the results are derived from the respective papers and their reported benchmark comparisons. We updated this context in section 4.2 of our revised manuscript.
[R1-C30]: Table 5: What is the metric being compared and why is the student seemingly better than a cardio resident? Please verify the scores.
We have verified the AUC scores and confirmed their authenticity. The observation that medical students can be better than cardiology residents aligns with findings in prior studies (End of Page 6 in [6]). Specifically, medical students often perform better in specific evaluations due to their recent, focused training, whereas cardiology residents may not engage with detailed ECG interpretation as frequently in their daily practice.
[6] Ribeiro, Antônio H., et al. "Automatic diagnosis of the 12-lead ECG using a deep neural network." Nature communications 11.1 (2020): 1760.
We hope this message finds you well. We kindly ask if any remaining concerns require further clarification or justification. In case you find our work better satisfactory given our previous responses, we would greatly appreciate your feedback on our submission. Thank you for your time and consideration.
"Furthermore, we introduce a contrastive objective based on the Siglep (Sigmoid language ECG pre-training) loss". It is still unclear whether "Siglep" has been used before since you say "based on SigLep". Or do you mean that we introduce "Siglep" based on Siglip?
"Furthermore, we introduce a contrastive objective based on the Siglep (Sigmoid language ECG pre-training) loss". It is still unclear whether "Siglep" has been used before since you say "based on SigLep". Or do you mean that we introduce "Siglep" based on Siglip?
In this context, we emphasized the usage of additional contrastive loss (Siglep), while we had mentioned that we adapted the Siglip implementation for Siglep (Line 245).
Thank you for your review again and we look forward to your feedback.
The Siglip is introduced in line 138 which I had pointed to while this explanation comes in 245. It should be clear upon introduction.
We appreciate this and had just made a small adjustment on Line 138 to be clearer upon introduction as you suggested. Thank you for your response again.
By introducing the auxiliary information in the form of reports, the functionality is also limited to the context and the report's accuracy (automatic reports are usually inaccurate). It would have been good to test for labels outside the reports like the subject's age and sex. But you could at least mention that when comparing it with unsupervised approaches.
By introducing the auxiliary information in the form of reports, the functionality is also limited to the context and the report's accuracy (automatic reports are usually inaccurate). It would have been good to test for labels outside the reports like the subject's age and sex. But you could at least mention that when comparing it with unsupervised approaches.
Together with our previous responses in Part 2, using these text reports could help guide the multimodal representations pre-training by their nature in providing various clinical concepts. Also, the MIMIC IV ECG dataset is uniquely large and publicly available and has already been verified by some recent works (e.g. MERL) to enhance cardiovascular diagnosis.
Furthermore, we need to mention again that we had our extensive downstream experiments on the pre-trained model without any text or diagnostic usage. For example, Table 1 shows our ECG encoder performs well in the conventional classification and patient identification tasks.
Finally, it is indeed uncommon to ”test” age and sex in our clinical-focused scenarios, to the best of our knowledge.
We thank all the reviewers for their comments and useful feedback. We have just uploaded a revised version of the paper to address comments, and we have also mentioned in the response to each reviewer. For convenience, we report the main differences to the previous version here:
- Added experiments to demonstrate the effectiveness of ETM loss and MLM+MEM in sec. Appendix, Lines 828-852.
- Improved introduction with motivations from the existing literature in Lines 54-76.
- Explained modifications to ECG encoder design (e.g. positional encoding, masking ratio) in Lines 162-172.
- Improved explanation of our N3S technique in Lines 270-272.
- Added suggested references, figure captions, and language adjustments.
Dear Reviewers and Area Chairs,
We're entering the last three days of the discussion period. Four days ago, we addressed each reviewer’s comments and uploaded a revised version of our paper, which includes suggested adjustments and additional experiments. We are glad that Reviewer vCwP has updated their rating in light of our responses!
We would also greatly appreciate it if the other reviewers could have a look at our responses and consider adding their review if we have sufficiently addressed the remaining concerns. In particular, to Reviewer jws3, we humbly believe that our work deserves greater consideration than your current rating reflects. Should there still be unresolved questions, we would be happy to engage in further discussion.
Thank you for your time and consideration.
The paper introduces C-MELT, a framework that pre-trains ECG and text data using a contrastive masked auto-encoder architecture. C-MELT combines the strengths of generative and contrastive learning through masked language and ECG modeling to reconstruct data. It also leverages contrastive loss functions and a nearest-neighbor negative sampling strategy to improve cross-modal alignment. Extensive experiments on five downstream datasets show good performance of C-MELT. The paper is generally well-written and clearly structured, and the experimental results are comprehensive and convincing. The pre-train data is MIMIC-IV-ECG and the downstream datasets include PhysioNet 2021, PTB-XL, CSN, CPSC2018, and CODE-test - all are publicly available standard databases. The code and pre-trained model are not included during the peer review (while the authors mentioned they would release them upon acceptance).
Reviewers appreciate the comprehensive experiments and detailed discussions. However, some major concerns from different perspectives still remain there. From an ML perspective, the contributions of the ECG-specific transformer, Flan-T5 text encoder, and GPT to enhance prompt quality are incremental. From an experimental perspective, some comparisons with MERL's best results should be provided. From a data perspective, the study has not generated new ECG data and/or new human annotations for resources in the field. From a clinical perspective, while C-MELT achieves higher performance, it doesn't provide additional functionalities to be implemented in clinical decision-making.
审稿人讨论附加意见
There are detailed discussions between the reviewers and authors. The reviewers appreciate the comprehensive experiments and detailed discussions during the rebuttal phase. Since this paper has highly diverse review scores, the AC calls the discussions among reviewers and AC. jws3 maintains his/her score with high confidence (5). vCwP maintains his/her score with relatively low confidence (3).
Reject