PaperHub
6.0
/10
Poster4 位审稿人
最低3最高5标准差0.8
5
3
4
3
4.0
置信度
创新性2.8
质量2.8
清晰度2.5
重要性2.5
NeurIPS 2025

DAAC: Discrepancy-Aware Adaptive Contrastive Learning for Medical Time series

OpenReviewPDF
提交: 2025-05-08更新: 2025-11-15
TL;DR

A contrastive learning framework integrating discrepancy estimation and adaptive attention for medical time-series diagnosis.

摘要

关键词
Medical Time SeriesContrastive LearningMulti-View RepresentationDiscrepancy EstimationDisease Diagnosis

评审与讨论

审稿意见
5

The paper introduces DAAC (Discrepancy-Aware Adaptive Contrastive Learning), a framework aimed at enhancing the generalization of medical time-series models, particularly when labeled data is scarce and collected from single centers. DAAC integrates two core modules: a Discrepancy Estimator, which learns the distribution of normal samples using a GAN-enhanced encoder-decoder and calculates reconstruction errors to represent abnormality, and an Adaptive Contrastive Learner, which leverages multi-head attention to contrast multiple data views (subject, trial, epoch, and temporal) without relying on handcrafted positive/negative sample pairs. This design mitigates overfitting, supports robust representation learning, and improves model performance in downstream diagnostic tasks. Extensive experiments across three medical datasets—Alzheimer’s, Parkinson’s, and myocardial infarction—demonstrate that DAAC significantly outperforms existing baselines, even with only 10% labeled data. The method excels in both partial and full fine-tuning scenarios, achieving state-of-the-art results across multiple evaluation metrics. Visualizations of multi-view embeddings confirm the effectiveness of the contrastive design in capturing subject- and view-specific information. While DAAC already shows strong results, the authors acknowledge that its discrepancy estimation could be further refined for more precise integration of external knowledge, suggesting room for future improvements in balancing auxiliary data with target domain characteristics.

优缺点分析

Strengths: DAAC demonstrates notable strengths in addressing two major challenges in medical time-series analysis: limited labeled data and generalization beyond single-center datasets. By combining a GAN-based discrepancy estimator with a hierarchical contrastive learning framework, it effectively leverages abundant external normal samples without requiring handcrafted contrastive pairs. The model incorporates subject, trial, epoch, and temporal levels in its contrastive objective, resulting in robust, multiscale representations. Extensive experiments across diverse disease datasets show that DAAC consistently outperforms state-of-the-art baselines, even when using only 10% of labeled data. Its modular design and strong empirical results underscore its practicality for real-world clinical deployment in low-resource settings. Weaknesses: Despite its strengths, DAAC has certain limitations. The use of a single reconstruction-based discrepancy signal from external normal data may provide only a coarse representation of abnormality, which could limit the precision of feature augmentation. While the model avoids manual pairing in contrastive learning, the multi-level and multi-view loss architecture introduces significant complexity in training, requiring careful tuning of multiple hyperparameters. Additionally, the potential for distributional mismatch between external normal data and target patient data may still affect generalization if not managed properly. Future work is needed to refine discrepancy modeling and explore more adaptive ways to balance external and target domain knowledge.

问题

Q1) The proposed method was tested on only 2 medical modalities, EEG and ECG. What are the next target modalities in the medical domain? Q2) How about outside of the medical domain?

局限性

The authors state that “We explicitly discuss limitations in Section 6”, but the main text ends with “5 Conclusion”. Where is the section 6?

最终评判理由

After thoughtful rebuttal discussion, I keep my rating.

格式问题

Nothing particular except missing Section 6.

作者回复

Dear Reviewer,

Thank you very much for your thoughtful review and constructive feedback on our submission. We truly value the time and effort you have dedicated to evaluating our work. Please find our point-by-point responses to your comments below.

Q1: Discrepancy signal granularity

A1: Thank you for raising this insightful question. We intentionally design the discrepancy feature as a sequence-level reconstruction error to achieve greater robustness and generalization, especially in the first phase of our three-stage training pipeline. This design aligns with our coarse-to-fine learning strategy, where early stages focus on global guidance and later stages handle finer representations. While finer-grained signals (e.g., point-wise or channel-wise errors) may capture local anomalies, the presence of distribution shifts between external and target datasets can make these fine-grained discrepancies prone to overfitting. This may introduce harmful biases into the downstream representation learning of the ACL module and compromise its generalization ability. To validate our design choice, we implemented and compared the following discrepancy feature variants. Results are included in Appendix E Discrepancy Estimator Study: As shown in the table above, while alternatives Channel-wise MSE vector and Point-wise error sequence capture more local detail, they yield inferior performance compared to our scalar sequence-level discrepancy. We attribute this to their sensitivity to non-generalizable local noise patterns from external data, which can introduce instability in early-stage training and hinder the learning of transferable features. Moreover, we emphasize that this discrepancy signal is only used in the first stage as a weak, auxiliary cue. It is later refined and complemented by the multi-level Adaptive Contrastive Learner (ACL), which captures detailed temporal and spatial structures. Therefore, our choice reflects a principled balance between stability, generality, and progressive refinement, and we find it to be both effective and practical across diverse tasks.

Discrepancy FeatureAD AUROCTDBrain AUROCPTB AUROC
Sequence-level MSE scalar (ours)98.0397.6395.56
Channel-wise MSE vector97.3697.5895.44
Point-wise error sequence96.7196.8695.37
Cluster-wise distance96.2896.4494.95

Q2: Loss complexity and its hyperparameter stability

A2: We appreciate this constructive comment and have added a detailed sensitivity analysis in Appendix G.2. To clarify, the 1:1:1:1 weights for the four hierarchical loss components follow prior work [2], which established this as a stable default. Our analysis in added Appendix D2 further supports this configuration's robustness. For the view-level loss LVL_V, we conducted additional experiments by varying its weight. Limited by the size request, we can only put part of the result herein, more results will be shown in added Table D1,The key findings are:

  • Effectiveness: Including LVL_V significantly improves performance, confirming its necessity.
  • Saturation: Performance gains diminish as the weight increases, plateauing around a weight of 2.
  • Robustness: A weight of 2 provides consistently strong results across all datasets.

In summary, our proposed loss design, including the weight for LVL_V, is both effective and robust. We have noted that a joint optimization of all weights is a promising direction for future work.

DatasetSettingWeightsAccuracyPrecisionRecall
AD1001,1,1,1,084.50±4.4688.31±2.4282.95±5.39
1,1,1,1,188.50±4.2090.20±3.6087.30±4.90
1,1,1,1,291.16±4.1492.10±3.6690.50±4.51
1,1,1,1,391.02±3.1592.06±3.5490.58±5.11
101,1,1,1,091.43±3.1292.52±2.3690.71±3.56
1,1,1,1,192.30±2.2092.80±2.1092.10±2.50
1,1,1,1,292.77±2.3592.92±2.4392.55±2.39
1,1,1,1,392.86±2.4492.33±3.6192.63±2.01

Q3: About the risk of distribution mismatch between external data and target data

A3: Thank you for this thoughtful and important question. We agree that relying on external normal data assumes a certain level of domain alignment. In practice, however, our external and target datasets exhibit notable differences, including variations in population demographics and acquisition protocols. For example, in the AD task, the external dataset ([1]) has an average subject age of 63.6 years, while the target dataset ([2]) has an average age of 72.5 years. When training the Discrepancy Estimator (DE) using this external dataset, we indeed observe signs of negative transfer—specifically, in Table 2, the AUROC of COMET+DE drops from 94.44 to 93.97. Despite this, DAAC still outperforms both COMET+ACL and COMET+DE in the same setting, as shown in Table 2. This suggests that the Adaptive Contrastive Learner (ACL) effectively mitigates the potential negative impact introduced by domain-shifted external data, and in some cases, even corrects for suboptimal guidance from the DE. This robustness highlights the strength of DAAC’s multi-stage design in handling real-world domain variability. We have added the related discussion in Appendix C3.

Q4: The proposed method was tested on only 2 medical modalities, EEG and ECG. What are the next target modalities in the medical domain? How about outside of the medical domain?

A4: Thank you for this constructive question. Although our current experiments are limited to EEG and ECG modalities, DAAC is designed as a general and extensible framework for learning representations from hierarchical electrophysiological signals, and is not restricted to specific data types. In future work, we plan to extend DAAC to additional modalities that share similar structural properties, including but not limited to: EMG, ECoG, MEG, fNIRS, PPG, respiratory signals, and GSR. These modalities are widely used in clinical diagnostics, cognitive neuroscience, and human-computer interaction, and typically exhibit clear subject, trial, epoch, and temporal organization, which aligns well with DAAC's multi-level contrastive learning strategy. We believe that applying DAAC to these broader signal types will not only validate the method’s generalizability, but also greatly enhance its practical utility across diverse real-world applications. While our current work focuses on medical time-series, we believe DAAC can also be applied to domains such as fault diagnosis, predictive maintenance, and battery life prediction. These tasks often involve hierarchical time-series and have abundant healthy data but limited labeled faults—conditions that align well with DAAC’s design, particularly the use of external normal data for discrepancy estimation. We see this as a promising future direction.

Q5: The authors state that “We explicitly discuss limitations in Section 6”, but the main text ends with “5 Conclusion”. Where is the section 6?

A5: Thank you for pointing this out. This was a formatting oversight in our submission. The limitations are indeed discussed in the final part of Section 5 (Conclusion). We have corrected the section reference and will ensure that this is clearly marked in the revised version.

Your insights have been instrumental in helping us improve the clarity and rigor of our paper. We have carefully addressed each concern and believe that the revised version better reflects the contributions and robustness of our approach.

Sincerely,

The Authors

References [1] Miltiadous, Andreas, Katerina D. Tzimourta, Theodora Afrantou, Panagiotis Ioannidis, Nikolaos Grigoriadis, Dimitrios G. Tsalikakis, Pantelis Angelidis et al. "A dataset of EEG recordings from: Alzheimer’s disease, Frontotemporal dementia and Healthy subjects." OpenNeuro 1 (2023): 88. [2] Escudero, J., Abásolo, D., Hornero, R., Espino, P. and López, M., 2006. Analysis of electroencephalograms in Alzheimer's disease patients with multiscale entropy. Physiological measurement, 27(11), p.1091.

评论

Thank you very much for your thoughtful rebuttal discussion. Now I have a clearer vision for the paper. Although the proposed method has only been verified with a limited dataset, I believe that the idea behind it will be of great interest to the general audience at NeurIPS.

审稿意见
3

The manuscript introduces DAAC, a learnable multi-view contrastive framework that leverages an external set of normal (i.e., healthy) samples and an adaptive contrastive strategy to strengthen representation learning. Extensive experiments on three clinically heterogeneous datasets—Alzheimer's disease (AD), Parkinson's disease (PD), and myocardial infarction (MI)—demonstrate that DAAC consistently outperforms competing methods, even when only 10 % of the target data are labelled.

优缺点分析

Strengths

  1. Broad empirical validation: Experiments span three distinct modalities and disease types, and the proposed method improves average performance in every setting.

  2. Label-efficient representation learning: DAAC obtains contrastive embeddings without requiring additional labels, facilitating deployment in domains where annotation is costly.

  3. Comprehensive baselines: The study benchmarks against several strong methods, providing a clear view of relative gains.

Weaknesses

  1. Scale of the external normal set

    The paper repeatedly refers to a “large-scale” normal pool, yet its concrete size and the sensitivity of downstream performance to that size remain unclear. Given that gathering sizeable healthy cohorts is not always trivial in medical contexts, an ablation that varies normal-set size (e.g., 1 k → 5 k → 10 k subjects) and quantifies its effect on target-task metrics would greatly clarify practical requirements.

  2. Assumption on reconstruction error (L 139 - 140)

    The text states that the reconstruction error E is “smaller for healthy samples … and larger for abnormal samples.” This hypothesis should be verified experimentally—e.g., by plotting the error distributions for normal vs. abnormal cases or by reporting AUROC when E is used as an anomaly detector.

  3. Architectural transparency

    Details of the encoder/decoder designs (layer counts, filter sizes, activation functions, parameter counts) and the reasoning behind their selection are missing. Providing a table or schematic—ideally with a brief discussion of alternatives considered—would improve reproducibility.

问题

Questions

  1. Operationalising “large-scale” normal set

    Please define the absolute size (number of subjects / images) and class balance of the external normal pool used in each experiment. In addition, supply an ablation (e.g., 10 %, 25 %, 50 %, 100 % of your current pool) showing how downstream performance varies with normal-set size.

    Criteria: If ablation shows (i) the minimal size needed to reach ≥95 % of the reported gain, and (ii) performance degrades gracefully when data are scarce, I will view scalability concerns as addressed and raise my score. Missing or inconclusive evidence would lower confidence.

  2. Empirical verification of the reconstruction-error assumption (L139-140)

    Provide quantitative evidence (e.g., AUROC, distribution plots) that healthy samples indeed yield lower reconstruction error E than abnormal samples across all datasets.

    Criteria: In all three data sets, the distribution of normal and abnormal reconstruction errors is shown, supporting what is claimed in L139-140.

  3. Architectural transparency and rationale

    Supply a concise table (layers, filter sizes, activation functions, parameter count) for both encoder and decoder, plus a short justification for key design choices (e.g., depth, latent dimension).

    Criteria: Providing full specs and reasoning (or result of preliminary experiments to determine the architecture) would resolve this weakness and may improve my reproducibility criterion.


Minor points.

  • The left-hand subplot is labeled “surpervised.” Please correct this to “supervised.”
  • Formatting of Table 2.: What purpose does the boldface serve? Since an asterisk (*) already denotes the proposed method, using bold to mark top performance may confuse readers—particularly when multiple entries appear in bold.

局限性

No. Although the manuscript outlines several methodological constraints, it omits two practical limitations that may affect real-world adoption:

  • Data-collection burden for large normal cohorts. DAAC relies on a sizeable external pool of healthy cases, yet assembling such cohorts can be logistically demanding and financially expensive. Quantifying the sensitivity of performance to normal-set size or discussing data-sharing partnerships would clarify feasibility.

  • Privacy considerations for normal data. Even “healthy” clinical data can contain identifiable information. A brief discussion of de-identification, consent, and compliance with regulations (e.g., HIPAA, GDPR) would strengthen the societal-impact analysis.

Addressing these points would provide a more balanced view of the framework's limitations and ethical implications.

最终评判理由

The authors are to be commended for performing additional experiments, such as ablation studies, which contribute positively to the empirical assessment. That said, the explanation of the criteria underlying model selection is incomplete. While part of the concern has been addressed, prompting me to raise my overall rating, I still judge the work to fall short of the acceptance threshold.

格式问题

No

作者回复

Dear Reviewer,

We extend our sincere appreciation for your time and expertise in reviewing our work. Our detailed responses to the reviewers' comments are presented below.

Q1: Minor issues – typo in ‘surpervised’; unclear bold formatting in Table 2.

A1: Thank you for pointing these out. We have corrected the typo "surpervised" to "supervised" in the revised figure. Also, We have clarified the table formatting: "boldface now consistently indicates the best performance per task. underline denotes the second best". Meanwhile, we have updated the captions of Table 1 and Table 2 accordingly to avoid term confusion, which will largerly improve our clarity. We appreciate your careful reading and helpful suggestions.

Q2: Please define the absolute size (number of subjects / images) and class balance of the external normal pool used in each experiment. In addition, supply an ablation (e.g., 10 %, 25 %, 50 %, 100 % of your current pool) showing how downstream performance varies with normal-set size.

A2: We appreciate the reviewer’s concern regarding the feasibility and value of using a large-scale healthy population pool. To address this, we conducted ablation experiments on all three datasets (AD, PTB, TDBrain), varying the proportion of healthy individuals used for pretraining (e.g., 5%, 25%, 50%, and 100%), as shown in Table E3. For each proportion, we further evaluated performance under different amounts of labeled training data (5%, 25%, 50%, 100%). Due to space constraints, we only present the results for the AD, PTB dataset in the main text, but consistent trends were observed across all datasets. We have updated Appendix E3 to clarify this point and provide additional results. Key findings:

  • Across all datasets and settings, increasing the size of the healthy population consistently leads to better model performance.
  • The improvement is existed even in low-resource settings, where the contrastive pretraining helps compensate for limited supervision.
  • The performance trend is stable and monotonic, suggesting that more external healthy data continues to benefit the model without introducing instability.
  • These results confirm that the proposed approach can adapt to different scales of external data and highlight the practical value of incorporating healthy population information, even when collected in limited amounts. This validates the scalability and deployment feasibility of our method, and demonstrates that the use of external healthy individuals is both effective and justifiable.
DatasetLabel Fraction (%)External RatioAccuracyPrecisionRecallF1 scoreAUROCAUPRC
AD100100%93.23 ± 5.2594.01 ± 3.9892.71 ± 5.8692.97 ± 5.6598.03 ± 1.7198.03 ± 1.75
50%92.21 ± 3.8993.48 ± 3.9492.22 ± 4.6192.70 ± 4.1397.10 ± 3.2197.96 ± 3.28
25%91.45 ± 4.2193.10 ± 4.5591.73 ± 4.3392.04 ± 4.1496.84 ± 3.9596.67 ± 4.22
5%91.23 ± 4.0192.15 ± 3.0690.70 ± 4.0191.88 ± 4.3396.34 ± 2.2296.13 ± 2.23
10100%94.67 ± 1.8494.93 ± 1.7694.41 ± 1.9994.58 ± 1.8998.10 ± 1.2098.19 ± 1.15
50%93.24 ± 2.8694.01 ± 2.9193.76 ± 2.6393.22 ± 2.6798.01 ± 1.7297.94 ± 1.88
25%92.93 ± 3.1993.56 ± 3.2593.24 ± 3.6093.17 ± 3.1497.87 ± 2.1097.66 ± 2.41
5%92.87 ± 2.3593.02 ± 2.4392.89 ± 2.3793.42 ± 2.8797.50 ± 1.4597.49 ± 1.57
PTB100100%93.65 ± 2.3593.28 ± 1.8685.39 ± 3.8487.64 ± 3.2795.56 ± 1.5790.28 ± 3.49
50%92.60 ± 2.1292.89 ± 2.6385.30 ± 4.3685.71 ± 3.4495.30 ± 2.3490.20 ± 3.47
25%92.12 ± 3.1592.55 ± 2.9084.54 ± 5.1985.69 ± 4.3194.27 ± 2.8390.16 ± 3.98
5%91.50 ± 2.0391.89 ± 2.8384.40 ± 4.0985.67 ± 3.7194.23 ± 1.9490.05 ± 3.20
10100%91.35 ± 2.7791.56 ± 3.1285.23 ± 2.6585.61 ± 3.5696.30 ± 3.2297.87 ± 4.02
50%91.05 ± 3.1091.19 ± 2.9684.91 ± 3.9285.55 ± 3.6296.26 ± 3.4296.88 ± 3.55
25%90.93 ± 3.4890.61 ± 3.2784.52 ± 4.3485.48 ± 3.8996.07 ± 3.8896.03 ± 4.12
5%90.80 ± 2.9390.06 ± 2.4783.80 ± 4.8685.43 ± 3.2795.91 ± 3.5195.80 ± 3.39

Q3: Provide quantitative evidence (e.g., AUROC, distribution plots) that healthy samples indeed yield lower reconstruction error E than abnormal samples across all datasets.

A3: We appreciate the reviewer’s suggestion and have conducted additional analysis to empirically validate the assumption. Due to the submission policy disallowing figures in the rebuttal, we provide a textual description of the key results and refer the reviewer to Figure F1 (has been added to the revised manuscript) in Appendix F.2 for visual confirmation.

We performed Kernel Density Estimation (KDE) on the reconstruction errors of both normal and abnormal samples, and showed in Figure F1. The normal samples exhibit a sharp peak centered around a low reconstruction error (~0.2), with very narrow spread, indicating consistent low-error reconstructions. In contrast, the abnormal samples follow a positively skewed distribution, with reconstruction errors more broadly distributed and a long tail extending toward higher values (~1.0). This empirical result confirms that the model consistently assigns higher reconstruction errors to abnormal samples.

Furthermore, using the reconstruction error as the anomaly score yields an AUROC of 0.974 on the AD dataset, quantitatively supporting the model’s discriminative capability between normal and abnormal samples.

Q4:Supply a concise table (layers, filter sizes, activation functions, parameter count) for both encoder and decoder, plus a short justification for key design choices (e.g., depth, latent dimension).

A4: We appreciate the reviewer’s feedback regarding the architectural transparency of our model. To address this, we have included detailed architectural specifications in Table G1 now clearly outlines the encoder type, dimensional configurations (input/output/hidden/channel/head), activation function, and parameter count. These additions ensure that all design choices—layer counts, attention structure, and parameter sizes—are explicitly stated.

ComponentSetting
Encoder TypeDaulMultiHeadTSEncoder
Input DimensionsTask-specific (e.g., 1 or 2 for AD)
Output Dimensions320
Hidden Dimensions64
Encoder Depth10
Number of Heads2
Head Dimension160
Channel Dimension320
Activation FunctionReLU (implicit via submodules)
Parameter Count (AD)946,368

Q5: The privacy considerations involved even when using “healthy” clinical data.

A5: We agree that even healthy clinical data may raise privacy concerns. As noted in our experimental section, all datasets used in this study are publicly available and had received institutional review board (IRB) approval prior to release, which ensures appropriate de-identification and regulatory compliance (e.g., HIPAA, GDPR).

Reply to the concerns on significance and clarity

We really appreciate for your valuable insights. These comments allow us to further improve our work quality via revision. We hope the responses have fully addressed your questions and provide strong support for the significance of our method. With all these designs, our approach demonstrates significant performance improvements across multiple datasets (AD, PTB, TDBrain) even in a low-resource setting. For example, on the AD dataset, even with only 5% of external healthy data, the model could significantly outperform the baseline without DE pretraining. This validates the effectiveness and practical adaptability of our framework. In addition, we really care about your feedback regarding clarity. Accordingly, we have revised our paper seriously, which includes clarifying the method definitions (see Table 1), architectural details (see Table G1), and experimental setup (see Appendices E and F), etc. In addition, a native speaker is asked to further improve the narratives in the manuscript. We suppose these endeavors could help to alleviate your clarity concern. Once more, thank you very much for your time and suggestions for reviewing our paper.

Best regards,

The Authors

评论

Thank you for your thoughtful and detailed rebuttal, and for your efforts to enhance the clarity and completeness of the manuscript. I appreciate the updates regarding the ablation studies, reconstruction-error analysis, architectural transparency, and privacy discussion.

On Q2 (Normal-set size), I find the ablation results at multiple scales (5%, 25%, 50%, 100%) to be sufficiently informative. These results demonstrate that performance improves consistently as the size of the external normal set increases, and that the method remains effective even when only a small portion of normal data is available. This provides strong support for the scalability and practical adaptability of the proposed approach, particularly in low-resource clinical settings.

For Q3 (Reconstruction error), the KDE analysis and AUROC score offer convincing support for the hypothesis stated in the original manuscript. I appreciate that these were added, and that clear evidence was included in the revised version.

Regarding Q4 (Architecture), I welcome the added specification in Table G1. However, the rationale behind the chosen architecture—why this Transformer configuration was used over alternatives—remains underexplored. I encourage the authors to provide at least a brief justification or empirical motivation in the final version, especially to help future researchers understand design decisions and assess generalizability.

On Q5 (Privacy), while the current datasets are publicly available and IRB-approved, I believe the paper would benefit from acknowledging potential privacy risks that may arise in real-world clinical deployments. Even “healthy” data can carry risks of re-identification or linkage, particularly in small or demographically homogeneous cohorts. Briefly noting this limitation would strengthen the discussion.

Lastly, I appreciate the revisions made to improve clarity and presentation, including the use of a native speaker for proofreading. These efforts are evident and improve the overall quality of the manuscript.

In summary, while some open questions remain, I believe the authors have provided thoughtful and well-reasoned responses within the scope of what is feasible during the rebuttal phase. With minor clarifications and additional contextualization, the manuscript would be further improved.

评论

Dear Reviewer LgXM,

Thank you for your constructive and encouraging feedback following our rebuttal. We are particularly grateful for your acknowledgment of the "thoughtful and well-reasoned responses" and that our revisions have "improved the overall quality of the manuscript." Your assessment of our scalability analysis (Q2) as "sufficiently informative" and "strong support for practical adaptability" is especially appreciated, as is your validation of our reconstruction-error verification (Q3) as providing "convincing support."

We are pleased that our efforts to enhance clarity and completeness have been well-received, and that the transparency improvements have addressed initial concerns about reproducibility. The finding that our method demonstrates "consistent performance improvements" and remains "effective even with limited normal data" reinforces the practical value we aimed to demonstrate.

Regarding your remaining suggestions:

  1. Architecture rationale: We will incorporate a brief empirical justification for our Transformer configuration in the camera-ready version, including preliminary experiments that guided our design choices.

  2. Privacy considerations: We will add a discussion acknowledging potential re-identification risks in real-world deployments, particularly for small or demographically homogeneous cohorts, as you suggest.

Given that we have provided responses "within the scope of what is feasible during the rebuttal phase" and that "minor clarifications" would further improve the manuscript, we will correspondingly revise our manuscript as mentioned above, we believe the final version will achieve the paper quality expected for this venue.

Thank you again for your thorough engagement with our work and for guiding us toward a stronger manuscript.

Best regards,

Authors of Submission #10704

审稿意见
4

This paper proposes DAAC (Discrepancy-Aware Adaptive Contrastive learning), a novel framework designed to address the challenges of limited labeled data and single-center bias in medical time-series analysis. It has two core components: a Discrepancy Estimator and an Adaptive Contrastive Learner. Extensive experiments on three clinical datasets Alzheimer’s disease, Parkinson’s disease, and myocardial infarction demonstrate that DAAC achieves superior generalization and diagnostic performance, particularly under low-label regimes.

优缺点分析

Strength The paper is focusing on a very important domain The experiments demonstrating fine-tuning for downstream classification tasks are comprehensively presented in the paper. The proposed method demonstrates strong performance in downstream classification tasks, even with limited labeled data.

Weakness There are many terms referring many kind and source of data, including normal data/external data/target data/original data, which is quite confusing. It makes the functionality of the Discrepancy Estimator unclear. The experimental setup is not clear, somehow confusing. There are many abbreviations has no full name, especially for the ablations, which makes the effectiveness of each part very hard to interpret.

问题

From my understanding the discrepancy is just an augmented feature of the original data, so why do you choose discrepancy as this additional fearture? what about other metrics like the data point to the cluster of normal/disease data? There should be some intuitions. Which part do you think contribute the most to the final improvement? Can you define kind and source of data, including normal data/external data/target data/original data

局限性

As above

最终评判理由

I recommend 4. The reviewer have solved my concerns.

格式问题

N/A

作者回复

Dear Reviewer,

We sincerely appreciate your thoughtful review and helpful suggestions. We have addressed all comments and provided detailed responses below.

Q1: This paper use many terms refers data types (e.g., normal/external/target/original data) and includes unexplained abbreviations (especially in ablation studies), leading to confusion about the Discrepancy Estimator’s functionality and experimental setup.

A1: Thank you for your suggestion. We have revised the model names for consistency and clarity, and provided clear definitions for each model in the captions of Table 1 and Table 2. Also, we additionally provide explanations for all key terms about data types in Appendix C. In the medical field, datasets from different hospitals often exhibit a certain degree of distributional shift. Models trained on data from one hospital may demonstrate unstable performance when applied to others. To enhance model generalization, we propose leveraging DE (Discrepancy Estimator) to incorporate data from other hospitals. Although distributional differences exist, we argue that the sequential patterns of normal data (i.e., data from healthy patients) can still provide valuable guidance for diagnosis in the target hospital. In this study:

  • The dataset from the current hospital is referred to as the internal dataset (or target dataset), which participates in the second and third stages of model training.
  • Datasets from other hospitals, used for guidance, are termed external datasets which are from healthy patients and contribute to the first stage of training.

Q2: The discrepancy is just an augmented feature of the original data, so why do you choose discrepancy as this additional fearture? what about other metrics like the data point to the cluster of normal/disease data?

A2: Thank you for raising this thoughtful question. We agree that the discrepancy feature serves as an auxiliary signal to enrich the original data representation. Our choice to use the sequence-level reconstruction error stems from the need for a robust signal that generalizes well under the potential distribution shift from the external dataset. Specifically, this design:

  • Avoids overfitting to local noise or domain-specific patterns, which can occur when using fine-grained or structure-sensitive alternatives (e.g., point-wise or cluster-based distances).
  • Aligns with our coarse-to-fine training strategy, where this global signal (i.e., discrepancy) guides the early stage and is progressively refined by the Adaptive Contrastive Learner (ACL) in later stages. To validate this choice, we further conducted ablation experiments (see Appendix F) comparing our method with alternative discrepancy features, including:
  • Channel-wise MSE vector (16-dim),
  • Point-wise reconstruction error sequence (256-dim),
  • Cluster-wise latent distance to normal data. As shown in Table F1, while these alternatives incorporate more structural or local cues, they consistently underperform compared to our scalar discrepancy. We attribute this to their higher sensitivity to noise and domain shift, which can degrade early-stage learning. In summary, our discrepancy design is not arbitrary, which reflects a careful tradeoff between stability, informativeness, and generalizability, and is empirically validated across diverse datasets and settings.
Discrepancy FeatureAD AUROCTDBrain AUROCPTB AUROC
Sequence-level MSE scalar (ours)98.0397.6395.56
Channel-wise MSE vector97.3697.5895.44
Point-wise error sequence96.7196.8695.37
Cluster-wise distance96.2896.4494.95

Q3: Which part do you think contribute the most to the final improvement?

A3: Thank you for the valuable question. As shown in Table 2, the Adaptive Contrastive Learner (ACL) provides the most consistent and significant improvements across all datasets under the full supervision setting (100% labeled data). However, under limited-label conditions (e.g., 10% labeled data), the Discrepancy Estimator (DE) becomes increasingly important. Therefore, the performance gain of DAAC comes from the synergistic integration of both components, rather than relying on a single dominant module. Specifically, DAAC’s effectiveness is amplified through:

  • Coarse anomaly guidance from DE, which helps ACL focus on more informative signals.
  • Multi-head adaptive view construction in ACL, enabling more diverse and robust contrastive pairs for better representation learning. We emphasize that the largest performance gains emerge when DE and ACL are progressively combined in a coarse-to-fine manner, as demonstrated by the full DAAC outperforming partial variants such as COMET+DE and COMET+ACL.

We are deeply grateful for the your constructive feedback. We hope that our detailed response provided above can address your questions and suggestions.

Best regards,

The Authors

审稿意见
3

This paper proposes a framework, Discrepancy-Aware Adaptive Contrastive learning (DAAC), for medical time-series classification, aiming to address challenges of limited labeled data and single-center bias. The complete framework involves pre-training a Discrepancy Estimator, using it to augment the target dataset, training the Adaptive Contrastive Learner in a self-supervised manner, and finally, fine-tuning for downstream diagnostic tasks. The authors demonstrate the effectiveness of DAAC on three clinical datasets (Alzheimer's, Parkinson's, and myocardial infarction), showing that it outperforms existing methods, especially in low-label scenarios.

优缺点分析

Strengths

-The method is validated on three distinct clinical datasets (Alzheimer's, Parkinson's, and Myocardial Infarction), demonstrating its applicability across different medical time-series modalities (EEG, ECG).

-The experiments are robust, testing the model's performance in both high-label (100%) and, crucially, low-label (10%) scenarios, which directly supports the paper's core motivation.

-The inclusion of comprehensive ablation studies provides strong evidence for the contribution of each component of the proposed loss function. Furthermore, a mutual information analysis validates the utility of the proposed discrepancy feature, showing it is more informative than any of the original features.

-The concepts of inter-view and intra-view contrastive learning are well-motivated and supported by compelling UMAP visualizations (Figure 3) that show the learned representations have the desired properties of being separable between views and discriminative between subjects.

-The proposed method of leveraging large, external "normal" datasets to mitigate overfitting and single-center bias is a promising direction for improving model generalization.

-The core contribution of the Discrepancy Estimator—using the reconstruction error from a GAN-style model trained on external normal data as an explicit feature—is an original and clever way to encode and transfer knowledge about the "healthy" state.

Weaknesses

-The dataset choices seem slightly arbitrary especially considering the fact that the medical timeseries domain has mature datasets like MIMIC with tens of thousands of papers.

-The overall loss function is a weighted combination of multiple terms, and the authors state they used a fixed weighting scheme without tuning. While this may show generality, the paper lacks an analysis of the model's sensitivity to these hyperparameters. It is unclear how performance might change with different weightings, which could be important for applying the method to new datasets.

-The proposed three-stage training process appears computationally expensive, requiring the pre-training of two separate models before final fine-tuning. The paper does not provide details on training times or overall computational cost, which is a minor weakness for reproducibility and assessing practical deployment feasibility.

-The authors acknowledge that using a single sequence-level reconstruction error as the discrepancy feature is "somewhat coarse". This simplification might discard finer-grained temporal information about a sample's abnormality.

-The naming of models in the results tables (e.g., "ACL*", "COMET DE*", "DAAC*") could be clearer. While the text implies their relationships, an explicit definition in the table captions would remove ambiguity for the reader trying to map the components of the ablation study to the final model.

-A key component, the Discrepancy Estimator, is contingent on the availability of a large, external dataset of purely normal subjects. This may not always be a feasible assumption, particularly for rare diseases or due to data-sharing restrictions. The paper does not explore the method's behavior if such a dataset is unavailable or small.

-The method assumes the external normal data and the target data "share similar characteristics". It does not address the potential risk of negative transfer if there is a significant domain shift between the source of the external data and the target data, which could cause the discrepancy feature to be misleading.

-The work builds heavily upon the hierarchical contrastive learning structure (subject, trial, epoch, temporal) from a recent paper, COMET. The paper is transparent about this, but it means the novelty lies in the two additions—the discrepancy estimator and the MHA-based adaptive views—rather than the entire hierarchical framework.

问题

  1. Regarding the sensitivity of the contrastive loss weights: The overall loss function is a weighted sum of five components with a fixed weight ratio of 1:1:1:1:2. You state that these weights were not tuned to demonstrate the method's generality. However, this choice warrants more justification. Could you please provide some intuition or an ablation study on the effect of these weights? For example, what is the reasoning for weighting the view-level loss (L_V) twice as much as the hierarchical losses? A sensitivity analysis on one of the datasets, showing how performance varies with different weightings, would significantly strengthen the paper by demonstrating the robustness of this design choice. This would increase my confidence that the method can be readily applied to new datasets without extensive tuning.

  2. Regarding the risk of negative transfer from external data: The Discrepancy Estimator relies on an external dataset of normal subjects to learn a representation of "health." This is a powerful idea but hinges on the assumption that the external and target datasets "share similar characteristics". Could you please discuss the potential risk of negative transfer if there is a significant domain shift between the external and target data (e.g., different sensor hardware, recording protocols, or patient demographics)? A discussion of this limitation, or ideally an experiment showing the method's performance when the external data source is intentionally mismatched (e.g., from a different public dataset), would be really useful about the practical deployment of this method.

  3. The design of the discrepancy feature: The choice to compute a single sequence-level reconstruction error as the discrepancy feature is justified by arguing it is more robust to noise than point-wise errors. While intuitive, this compresses a potentially rich error signal into a single scalar. Could you provide further justification for this design choice? For instance, did you experiment with alternative representations, such as a vector of channel-wise errors or a smoothed time-series of the reconstruction error itself?

  4. Were other more established datasets like MIMIC considered in the training and evaluation of this method?

局限性

The authors have made a commendable effort to be upfront about the limitations of their work. In the conclusion, they explicitly state that the discrepancy estimation is "suboptimal" and that the representation of discrepancy as a single value is "somewhat coarse". Furthermore, in their own checklist, they acknowledge that they did not provide detailed information about computational resources needed for reproducibility.

A key point to address would be the fairness and equity implications of the Discrepancy Estimator. Since the model's performance relies on an external dataset of "normal" subjects, it is critical to discuss the potential for bias if this external dataset is not demographically representative (e.g., in terms of age, sex, or ethnicity). For instance, the model could perform differently for underrepresented groups, leading to diagnostic disparities. Acknowledging this potential and discussing how to select or audit external datasets for fairness would significantly strengthen the paper's treatment of its societal impact.

最终评判理由

My primary concern regarding the method's limited applicability remains.

格式问题

N/A

作者回复

Dear Reviewer,

Thank you for your thoughtful and detailed comments. We have carefully addressed each of your concerns below and revised the manuscript accordingly.

Q1: Concern regarding whether more established datasets, such as MIMIC, were considered for evaluation.

A1: Thank you for this question. While MIMIC is a valuable dataset for EHR studies [1], our model, DAAC, is specifically designed for multi-level electrophysiological time-series (e.g., EEG, ECG) with inherent hierarchical structures (subject, trial, epoch, temporal). The selected AD, PTB, and TDBrain datasets exhibit these structures, providing a suitable testbed for DAAC. Their complexity and use in established benchmarks like COMET [2] ensure a proper and sufficient evaluation of our method's effectiveness (Appendix A.1).

Q2: Regarding the sensitivity of the contrastive loss weights.

A2: We appreciate this constructive comment and have added a detailed sensitivity analysis in Appendix G.2. To clarify, the 1:1:1:1 weights for the four hierarchical loss components follow prior work [2], which established this as a stable default. Our analysis in added Appendix D2 further supports this configuration's robustness. For the view-level loss LVL_V, we conducted additional experiments by varying its weight. Limited by the size request, we can only put part of the result herein, more results will be shown in added Table D1,The key findings are:

  • Effectiveness: Including LVL_V significantly improves performance, confirming its necessity.
  • Saturation: Performance gains diminish as the weight increases, plateauing around a weight of 2.
  • Robustness: A weight of 2 provides consistently strong results across all datasets.

In summary, our proposed loss design, including the weight for LVL_V, is both effective and robust. We have noted that a joint optimization of all weights is a promising direction for future work, and has been mentioned in revision.

DatasetSettingWeightsAccuracyPrecisionRecall
AD1001,1,1,1,084.50±4.4688.31±2.4282.95±5.39
1,1,1,1,188.50±4.2090.20±3.6087.30±4.90
1,1,1,1,291.16±4.1492.10±3.6690.50±4.51
1,1,1,1,391.02±3.1592.06±3.5490.58±5.11
101,1,1,1,091.43±3.1292.52±2.3690.71±3.56
1,1,1,1,192.30±2.2092.80±2.1092.10±2.50
1,1,1,1,292.77±2.3592.92±2.4392.55±2.39
1,1,1,1,392.86±2.4492.33±3.6192.63±2.01

Q3: Details of training process.

A3: Additional training details are now in Appendix G.1. All experiments were run on an NVIDIA RTX 4090 GPU. Our training pipeline is efficient, and the final model is lightweight for deployment. We will release the full code upon acceptance.

DatasetStage 1 TimeStage 2 TimeStage 3 TimeDeployment VRAM
AD1.2h3.2h0.2h4.0 GB
PTB2.0h6.5h1.8h6.2 GB
TDBrain1.8h5.5h0.5h5.3 GB

Q4: Using a single sequence-level reconstruction error as the discrepancy feature might be too coarse.

A4: We intentionally use a sequence-level error to ensure robustness during the initial training phase, as this coarse-grained signal prevents overfitting to noise or center-specific patterns from external data. While finer-grained measures like point-wise errors can be sensitive to physiological fluctuations and sensor artifacts [3], a sequence-level discrepancy captures global deviations more reliably. Our experiments (Appendix F) confirm that alternatives like channel-wise or point-wise errors yield inferior performance due to their sensitivity to local noise. This coarse feature acts as an auxiliary signal, and we designed the downstream Adaptive Contrastive Learner to subsequently extract fine-grained representations across multiple levels, ensuring detailed information is fully utilized.

Discrepancy FeatureAD AUROCTDBrain AUROCPTB AUROC
Sequence-level MSE scalar (ours)98.0397.6395.56
Channel-wise MSE vector97.3697.5895.44
Point-wise error sequence96.7196.8695.37
Cluster-wise distance96.2896.4494.95

Q5: Model names in the results tables are unclear.

A5: Thank you for the suggestion. We have revised the model names for consistency and provided clear definitions in the captions of Table 1 and Table 2.

Q6: The Discrepancy Estimator relies on large external normal datasets, which may be infeasible for rare diseases.

A6: This is an important point. (1) For the common diseases in our study (Alzheimer’s, Parkinson’s, etc.), collecting normal data is generally feasible and low-cost. (2) For rare diseases, we agree acquiring large-scale data is difficult. To assess this, we conducted an ablation study and found that even with only 5% of the healthy data, our model still achieves performance gains. This suggests the Discrepancy Estimator remains beneficial even in resource-constrained scenarios.

Q7: Could you discuss the potential risk of negative transfer from domain shift?

A7: We agree this is a key consideration. Our external and target datasets do exhibit domain shifts (e.g., average subject age of 63.6 vs. 72.5 years in the AD task [4, 5]). We observed signs of this negative transfer when applying a baseline model, where the AUROC dropped from 94.44 to 93.97 (Table 2). Despite this, DAAC still outperformed all baselines in the same setting. This demonstrates that our Adaptive Contrastive Learner (ACL) effectively mitigates the negative impact from domain-shifted data, showcasing the robustness of DAAC's multi-stage design. We have added this discussion to Appendix C3.

Q8: The novelty seems limited as the work builds heavily upon COMET.

A8: We acknowledge that we build upon COMET's framework. However, the core innovation of DAAC is the synergistic integration of the Discrepancy Estimator (DE) and the Adaptive Contrastive Learner (ACL) into a three-phase, curriculum-inspired framework. This design follows a coarse-to-fine progression:

  • Phase 1: DE learns a coarse, abnormality-aware signal from external data.
  • Phase 2: ACL uses this signal to dynamically discover and contrast fine-grained views.
  • Phase 3: The encoder is fine-tuned for the downstream task. The synergy of these components is vital. Our ablation studies (Table 2) show that simply adding DE or ACL to COMET yields accuracies of 84.82% and 91.16% respectively, while the integrated DAAC framework achieves 93.23%. This demonstrates that the framework itself is a meaningful advancement.

Thank you again for your patience and invaluable feedback.

Best regards,

The Authors

References

[1] Johnson, A. E. W., et al. (2021). MIMIC-IV, a freely accessible electronic health record dataset. Scientific Data.

[2] Wang, Y., et al. (2024). Contrast everything: A hierarchical contrastive framework for medical time-series. NeurIPS.

[3] Liu, Z., et al. (2023). Self-supervised contrastive learning for medical time series: A systematic review. Sensors.

[4] Miltiadous, A., et al. (2023). A dataset of EEG recordings from: Alzheimer’s disease, Frontotemporal dementia and Healthy subjects. OpenNeuro.

[5] Escudero, J., et al. (2006). Analysis of electroencephalograms in Alzheimer's disease patients with multiscale entropy. Physiological Measurement.

评论

Thanks for your reply! It solved some my concerns. However, I still am not sure why mimic is not a proper dataset? Doesn’t it includes ecg data? Thanks for the additional experiments on the combination of different weights. However I still believe a thorough ablation study on this part is important, as well as a discussion on the parameter selection.

评论

Dear Reviewer,

Thank you for your thoughtful feedback.

Previously, we examined the original MIMIC data and observed that patients often suffer from multiple conditions or comorbidities [1], which makes it difficult to isolate a sufficiently large and clean cohort for binary classification. Therefore, we initially did not include MIMIC in our experimental setup.

However, we have recently identified a study [2] that applies specific preprocessing steps on the MIMIC-IV-ECG dataset (Target dataset) [3] and ECG-ViEW II dataset (External dataset) [4] to enable prediction tasks for alcoholic liver disease. Based on this insight, we are doing experiments on the MIMIC-IV-ECG dataset and ECG-ViEW II dataset.

The full fine-tuning results on the MIMIC-IV-ECG dataset (using 100% labeled data) are presented in the table below:

ModelsAccuracyPrecisionRecallF1 scoreAUROCAUPRC
COMET80.02 ± 1.9881.67 ± 1.7278.14 ± 3.6880.45 ± 3.2289.05 ± 1.5686.40 ± 2.82
COMET+DE80.65 ± 2.0682.01 ± 2.3779.49 ± 4.5981.67 ± 4.0789.88 ± 1.9487.30 ± 3.20
COMET+ACL81.30 ± 2.0382.89 ± 2.8380.10 ± 4.0982.73 ± 3.8290.66 ± 2.2788.02 ± 3.05
DAAC82.50 ± 2.3583.88 ± 1.8681.39 ± 3.8484.64 ± 3.2791.56 ± 1.5788.98 ± 3.49

The results demonstrate that our method outperforms COMET baseline even on the newly added MIMIC dataset. Due to time constraints, more detailed experimental tables and discussions will be updated in the forthcoming version of our revised paper.

Additionally, we appreciate your recognition of the additional experiments on different loss weight combinations. As detailed in the revised Appendix D.2 (Loss Weight Sensitivity Analysis) and Table D1, we conducted a comprehensive analysis and ultimately adopted the 1:1:1:1:2 configuration based on two key observations of effectiveness of LVL_V and performance saturation with increasing LVL_V. For more detailed information about this experiment, please also refer to our response to Q2 in the rebuttal to Reviewer RMcN.

Regarding your request for more thorough ablation studies, we would like to highlight that we have included a detailed analysis in Appendix E.2 (Ablation Study of Contrastive Blocks). In this section, we systematically ablated the five contrastive components (P, R, S, O, V). The results show consistent performance improvements as each component is added, with the full configuration (P+R+S+O+V) achieving the best performance. This clearly demonstrates the effectiveness and necessity of the complete architecture. We sincerely appreciate your insightful suggestions and have further clarified these points in the revised manuscript. Best regards, The Authors

References:

[1] https://physionet.org/content/mimic-iv-demo/2.2/hosp/d_icd_diagnoses.csv.gz

[2] Alcaraz, J.M.L., Haverkamp, W. and Strodthoff, N., 2025. Electrocardiogram-based diagnosis of liver diseases: an externally validated and explainable machine learning approach. EClinicalMedicine, 84.

[3] Gow, B., Pollard, T., Nathanson, L. A., Johnson, A., Moody, B., Fernandes, C., ... & Horng, S. (2023). Mimic-iv-ecg: Diagnostic electrocardiogram matched subset. Type: dataset, 6, 13-14.

[4] Kim, Y. G., Shin, D., Park, M. Y., Lee, S., Jeon, M. S., Yoon, D., & Park, R. W. (2017). ECG-ViEW II, a freely accessible electrocardiogram database. PloS one, 12(4), e0176222.

评论

Thanks for your reply! I will adapt my score accordingly.

评论

Thank you for your detailed and thorough rebuttal. I appreciate the effort you have taken to address my concerns, particularly with the new experiments on hyperparameter sensitivity and the analysis of the discrepancy feature.

However, I will be retaining my original score. While your responses clarify many technical points, my primary concern regarding the method's limited applicability remains. Your rebuttal confirms that the framework is specifically designed for multi-level, hierarchically structured time-series like EEG and ECG. This specialization, while effective for the chosen datasets, leaves its broader utility for other, less-structured types of medical time-series data as an open and critical question. Consequently, I am not yet convinced of the work's significant impact beyond this specific niche.

评论

Dear Reviewer yeTd,

Thank you very much for your detailed and thoughtful feedback. We truly appreciate the time and effort you have taken to carefully review our work.

Regarding your concern on the method’s applicability, we fully acknowledge that our current framework is tailored for multi-level, hierarchically structured physiological bioelectrical signal data such as EEG and ECG. This design choice is dedicated, as such data is highly representative in many clinical contexts and present unique modeling challenges. While our experiments focus on datasets with this structure, we believe the core idea—particularly the integration of the Discrepancy Estimator and Adaptive Contrastive Learner in a coarse-to-fine manner—could inspire adaptations to broader types of medical time-series. Exploring these generalizations is indeed an exciting direction for our future work.

Once again, we sincerely thank you for your thoughtful engagement and the time you have devoted to improving the quality of our paper. Have a lovely day.

Best regards,

The Authors

最终决定

This paper proposes a hierarchical contrastive learning framework for specific kinds of medical time series. The experimental results and the proposed approach's ability to handle limited labels are key strengths. I've taken a look at the reviews and the author/reviewer discussion comments for this paper. The final ratings and justifications are a little bit scattered (in terms of ratings: 1 accept, 1 borderline accept, 2 borderline reject). From what I can tell, the bulk of reviewer concerns were addressed by the authors. Reviewer yeTd's major lingering concern seems to be about limited applicability (if I'm not mistaken, the issue seems to be that the proposed approach is only for time series with specific structure such as EEG and ECG): I'm not as bothered by this especially as there are many researchers focused on EEG and ECG time series (I understand that there are a bunch of other time series that are very different in structure, such as what comes from many wearable sensors or, separately, EHR data like MIMIC), so that this work could be beneficial to the community of ML (and ML-adjacent) researchers focused on EEG or ECG. Regarding reviewer LgXM's main lingering concern that the authors did not explain the specific choice of transformer architecture, I find this to be a valid albeit relatively minor concern, and I am okay with the authors' response that they'll discuss other things they tried for model selection in the camera ready. For these reasons, I am recommending acceptance for this paper.