KSP: Kolmogorov-Smirnov metric-based Post-Hoc Calibration for Survival Analysis
We propose a KS-based calibration method for survival models that avoids discretization and improves calibration while preserving predictive accuracy across real-world datasets and models.
摘要
评审与讨论
The authors propose a post-processing method (KS metric-based post-processing, or KSP for short) based on the Kolmogorov-Smirnov statistic for improving calibration in survival models. Similar to Platt scaling for calibrated classification, KSP applies a linear transformation to the logits of the predicted survival time CDF with an exponential adjustment to the transformed CDF. The three hyperparameters specifying this transformating can be chosen via gradient descent to minimize an empirical version of the KS metric on a validation set. The authors then conduct extensive experiments across 6 base models and 10 datasets, comparing KSP with other post-processing and in-processing methods for survival calibration. They find that KSP improves competitor methods in most scenarios without sacrificing discriminative performance, as measured by the C-index.
优缺点分析
Strengths
The paper is generally well-written and easy to follow.
The topic is relevant to the NeurIPS community and the contribution is solid. As the authors discuss, uncertainty quantification is important for the safe and effective application of ML techniques in high-stakes scenarios like healthcare, a domain in which survival analysis is particularly relevant. As the proposed method can be combined with any survival model which estimates a survival function, it is widely applicable.
The experimental validation is very extensive. The authors considered 60 total test scenarios (6 base models x 10 datasets) which cover a variety of model types and dataset sizes. The results are generally in favor of their method, and most of the time they obtain statistically significant improvements across a variety of calibration metrics as compared to the baseline methods. In scenarios where KSP underperforms, they give insightful discussion as to why this may be the case.
In addition to improved calibration, the method may also be less computationally expensive than other competitor methods for survival calibration.
I would also like to add that I reviewed an earlier version of this paper, and it is clear that the authors have made a substantial effort to address the weaknesses in the previous version.
Weaknesses
While KSP does frequently rank first in terms of the various calibration metrics as compared to the baselines, the absolute improvement is frequently not very large. For instance, in Table 12, the improvement of KSP over the second-best baseline is usually on the order of 10^-3 or 10^-4. Thus, the significance of the contribution (in the practical sense, not the statistical sense) may be somewhat limited.
There is a line of related work on proper scoring rules for survival analysis which is only briefly referenced in the paper (Yanigasawa, 2023 in the Related work section). The relationship is that proper scoring rules are loss functions which are minimized only by the ground truth survival function (at the population level), so implicitly the use of such a loss function will lead to calibration. It would strengthen the paper to include a model trained with such a loss function as one of the "uncalibrated" baselines to see how much gain is to be had from explicit calibration techniques.
问题
Are there some situations where KSP provides a larger absolute improvement in calibration as compared to the baselines?
What is a typical value of B (i.e., number of iterations for KSP to converge)?
How does KSP compare to training a survival with a proper scoring rule?
Why should the KM estimator be considered an empirical lower bound on calibration error (lines 223-224)?
局限性
Some limitations are discussed, mainly in the Experiments section (Section 5).
最终评判理由
I appreciate the authors' attempts to address my concerns, and the additional experiments were helpful. Unfortunately, it seems that the results on models trained using proper scoring rules is quite mixed. The positive correlation between the number of iterations B and the calibration error is also odd; one would hope that using more computational resources on the method would improve performance, but the opposite effect is observed. I also appreciate the clarification on the reason for considering the KM estimator as a lower bound on the calibration, but this actually raises another question, namely, why would one not use the KM estimator instead of any of the proposed methods if it is considered optimal?
Overall, it seems that KSP can indeed improve the calibration of pretrained survival models with little additional overhead, meaning that it may be useful in some settings, but I will refrain from enthusiastic acceptance due to the issues mentioned above.
格式问题
No major concerns.
We thank the reviewer for their thoughtful and constructive comments. We respond to each point below and will make all necessary clarifications and corrections in the final version.
. While KSP does frequently rank first in terms of the various calibration metrics as compared to the baselines, the absolute improvement is frequently not very large. For instance, in Table 12, the improvement of KSP over the second-best baseline is usually on the order of or . Thus, the significance of the contribution (in the practical sense, not the statistical sense) may be somewhat limited.
Thank you for the thoughtful comment. While it is true that the absolute magnitude of improvement in calibration error may appear small (e.g., on the order of or ), we believe these differences can still be meaningful for several reasons.
First, calibration error metrics are typically bounded between 0 and 1, and in well-trained models, absolute errors tend to be quite small. As shown in Table 12 of the main text, the CRPS model, when calibrated using KSP, achieves more than an 80% reduction in S-cal(20) and a 43% reduction in KS-cal, compared to calibration with CSD-iPOT. While the absolute improvements may seem modest, they are substantial in relative terms and reflect meaningful gains. Even modest improvements — especially in the low-error regime — can signify a meaningful reduction in systematic miscalibration. In safety-critical applications such as clinical risk prediction or reliability analysis, small but consistent calibration gains can be translated into increased trust in model decisions. Although improvements may appear modest for models with an already low calibration error, we believe that this is primarily a matter of scale.
Second, as seen in our experiments across diverse datasets and models, KSP consistently ranks among the top performers and often achieves the best results across multiple metrics. We believe this consistency, rather than a few large gains in specific cases, is a key strength of KSP.
Lastly, we emphasize that the KSP achieves these improvements with minimal computational overhead and without relying on binning or sampling. This practical efficiency further supports the utility of our method, even if the absolute gains may sometimes appear modest. Nonetheless, we agree that future work could further investigate settings in which calibration improvements lead to tangible benefits in downstream decision-making.
. There is a line of related work on proper scoring rules for survival analysis which is only briefly referenced in the paper (Yanagisawa, 2023 in the Related work section). The relationship is that proper scoring rules are loss functions which are minimized only by the ground truth survival function (at the population level), so implicitly the use of such a loss function will lead to calibration. It would strengthen the paper to include a model trained with such a loss function as one of the ``uncalibrated" baselines to see how much gain is to be had from explicit calibration techniques.
We appreciate your suggestion. Both the use of proper scoring rule-based loss functions and our proposed post-processing method share the common goal of approximating the true survival function. Therefore, we agree that combining them could potentially yield a synergistic effect.
We are currently conducting experiments in this direction. However, based on our findings, the models trained with proper scoring rule-based loss already exhibit low calibration error, so applying our post-processing method tends to offer only marginal gains. We will continue our experiments, and if we identify any cases where the post-processing method provides meaningful improvement, we will share the results during the rebuttal period.
. Are there some situations where KSP provides a larger absolute improvement in calibration as compared to the baselines?
This observation is especially evident in the CRPS, which tends to show the largest calibration errors across most scenarios. As it is not based on a discretized structure, KSP’s explicit focus on minimizing calibration error appears to be particularly effective in this case.
. What is a typical value of B (i.e., number of iterations for KSP to converge)?
The table below shows the number of iterations for each dataset and model averaged by 30 repetitions.
\begin{array} \hline \textsf{Dataset} & \textsf{DeepSurv} & \textsf{MTLR} & \textsf{Parametric} & \textsf{CRPS} & \textsf{DeepHit} & \textsf{AFT} \\ \hline \textsf{WHAS} & 329 & 300 & 989 & 409 & 188 & 426 \\ \textsf{METABRIC} & 338 & 488 & 523 & 378 & 423 & 599 \\ \textsf{GBSG} & 388 & 673 & 468 & 528 & 606 & 508 \\ \textsf{NACD} & 640 & 646 & 389 & 612 & 383 & 378 \\ \textsf{NB-SEQ} & 272 & 269 & 685 & 958 & 330 & 1087 \\ \textsf{SUPPORT} & 306 & 635 & 1107 & 609 & 649 & 314 \\ \textsf{MIMIC-III} & 1310 & 1310 & 698 & 1352 & 395 & 1660 \\ \textsf{SEER-liver} & 517 & 161 & 1288 & 1484 & 131 & 1362 \\ \textsf{SEER-stomach} & 431 & 169 & 1317 & 1578 & 184 & 1249 \\ \textsf{SEER-lung} & 430 & 114 & 1231 & 1199 & 112 & 1344 \\ \hline \end{array}
As discussed in the paper, we observed a tendency for the calibration error to increase as increases. When computing the Spearman rank correlation between and KS-cal, the result was 0.234.
. Why should the KM estimator be considered an empirical lower bound on calibration error (lines 223-224)?
As claimed in the CSD and CSD-iPOT papers, the KM estimator is considered an optimally calibrated estimator. Although it does not always represent a lower bound, we believe it serves as a suitable reference point. We acknowledge that we only mentioned this without explicitly citing it, and we will add the corresponding references.
We would like to share representative results from three datasets categorized by size (Small, Medium, and Large). The loss function used to train the baseline model was the 'Cen-log' loss described in the work by [1], which we selected based on its strong calibration performance reported in the paper.
As previously mentioned, there were cases where the baseline calibration error was already low, but this was not always the case. The examples we present here are three representative cases, selected to illustrate typical patterns observed across datasets of different sizes.
As argued in the main paper, KSP tends to outperform CSD and CSD-iPOT in terms of both D-cal(20) and KS-cal. We will include a proper scoring rule-based baseline in the revised manuscript to reflect this experimental addition.
\begin{array} \hline \textsf{Method} & \textsf{C-index} & \textsf{S-cal(20)} & \textsf{D-cal(20)} & \textsf{KS-cal} & \textsf{KM-cal} & \textsf{IBS} \\ \hline \textsf{Non-calibrated} & 0.74919 & 0.000301 & 0.001571 & 0.042645 & 0.004205 & 0.150402 \\ \textsf{KSP} & 0.74921 & 0.000294 & 0.001642 & 0.035677 & 0.004165 & 0.150555 \\ \textsf{CSD} & 0.74910 & 0.000538 & 0.002498 & 0.046158 & 0.002250 & 0.146773 \\ \textsf{CSD-iPOT} & 0.74908 & 0.000296 & 0.002095 & 0.038488 & 0.004347 & 0.150301 \\ \hline \end{array}
\begin{array} \hline \textsf{Method} & \textsf{C-index} & \textsf{S-cal(20)} & \textsf{D-cal(20)} & \textsf{KS-cal} & \textsf{KM-cal} & \textsf{IBS} \\ \hline \textsf{Non-calibrated} & 0.60409 & 0.000780 & 0.003335 & 0.154549 & 0.010493 & 0.207236 \\ \textsf{KSP} & 0.60328 & 0.000359 & 0.002069 & 0.035167 & 0.006972 & 0.204080 \\ \textsf{CSD} & 0.60364 & 0.000181 & 0.002097 & 0.031493 & 0.000968 & 0.197354 \\ \textsf{CSD-iPOT} & 0.60324 & 0.000352 & 0.003450 & 0.035793 & 0.007513 & 0.203774 \\ \hline \end{array}
\begin{array} \hline \textsf{Method} & \textsf{C-index} & \textsf{S-cal(20)} & \textsf{D-cal(20)} & \textsf{KS-cal} & \textsf{KM-cal} & \textsf{IBS} \\ \hline \textsf{Non-calibrated} & 0.63482 & 0.000351 & 0.000558 & 0.037989 & 0.002182 & 0.142298 \\ \textsf{KSP} & 0.63479 & 0.000305 & 0.000591 & 0.039045 & 0.001896 & 0.142285 \\ \textsf{CSD} & 0.63493 & 0.000177 & 0.004724 & 0.048546 & 0.000799 & 0.140735 \\ \textsf{CSD-iPOT} & 0.63493 & 0.000174 & 0.005254 & 0.051738 & 0.002790 & 0.142597 \\ \hline \end{array}
[1] Yanagisawa, H. Proper scoring rules for survival analysis.
Thanks to the authors for their response and for the additional results using proper scoring rules. Unfortunately, it seems that the results on models trained using proper scoring rules is quite mixed. The positive correlation between the number of iterations B and the calibration error is also odd; one would hope that using more computational resources on the method would improve performance, but the opposite effect is observed. I also appreciate the clarification on the reason for considering the KM estimator as a lower bound on the calibration, but this actually raises another question, namely, why would one not use the KM estimator instead of any of the proposed methods if it is considered optimal? Overall, I am inclined to maintain my score.
Thank you for your thoughtful follow-up and for engaging with our additional results. Below we address each of your points in turn:
While the results from proper scoring rule may appear mixed across datasets, we would like to highlight a consistent pattern we observed. When the baseline calibration error is large, KSP consistently reduces the error more effectively than CSD or CSD-iPOT (or similar level to two methods). In contrast, when the baseline calibration error is already small, KSP tends to maintain that low level, whereas CSD and CSD-iPOT sometimes increase the calibration error, especially in D-cal(20).
This suggests that KSP adapts well to the calibration level of the base model—providing necessary correction without over-adjustment. We note that KSP tends to preserve local discrepancies, even with relatively minimal changes to the original distribution. This robustness is one of the key strengths of KSP.
We appreciate your observation regarding the unexpected trend associated with increasing the number of iterations B. One possible explanation is that with higher calibration errors in the baseline model, a larger learning rate might be necessary for effective convergence. In our current experiments, we kept the learning rate fixed for all experiments (though we used ADAM optimizer), which may have limited the benefit of additional iterations. We believe there's opportunity for improvement through adaptive learning rate tuning, and we’ll consider this in future work.
The KM estimator is known to be both KM-calibrated and D-calibrated under assumptions (exchangeability, conditional independent censoring, and strict monotonicity of the KM estimate). Although it lacks any discrimination ability, it can be considered optimal from a calibration standpoint.
Although it would be theoretically possible that another nonparametric estimator could achieve better calibration, the KM estimator remains a widely used and computationally simple method, making it a natural reference point for desirable calibration levels.
In this sense, the KM estimator provides a useful benchmark. If a post-processing method can achieve calibration performance close to the KM estimator while preserving discrimination, it demonstrates a compelling advantage. This is precisely the goal of methods like KSP.
This theoretical result is detailed in Appendix B of [2].
We hope that our clarifications might provide additional context worth reconsidering.
[2] Qi, S. A., Yu, Y., & Greiner, R. (2024). Conformalized survival distributions: A generic post-process to increase calibration.
We sincerely appreciate your constructive feedback and will revise the manuscript accordingly.
A new calibration method is presented, which is based on the Kolmogorov-Smirnov statistic. It aims to avoid computational limitations and the reliance on binning of some earlier methods.
优缺点分析
-
As all approaches considered are basically heuristics, it would have been good to try and understand which method performs best in what situation. Currently, some benchmark datasets are used, but carefully designed synthetic data could provide much better insight into the inner behavior of the various methods.
-
The suggested method overcomes limitations of earlier methods due to computational complexity and discretization. It is, however, unclear how those earlier methods would perform, in principle, if we could have (close to) unlimited compute and the binning can be coarsened arbitrarily. Are the reported gains primarily the result of computational improvements and the continuous formulation or is there more?
-
I think the part of the paper leading up to Theorem 3.1 isn't clear to me, which probably makes me underappreciate the claim in Theorem 3.1. For one, I don't readily see that "Eqn. (3) is equivalent to (...)". Also, the text says "We begin with the simplified setting without covariates," but then the general result still follows. The paragraph at line 133 offers some observations and links to other methods, but it is unclear to me what I should take away. In the actual formulation of the theorem, (and maybe this is a silly question but) what does it mean for "calibration to hold"? Ultimately, the theorem is an asymptotic result and doesn't necessarily say anything about the finite sample behavior of KS-cal.
-
There are some statements that sound appealing, but remain unsubstantiated, as far as I can see. Some examples:
-
Abstract: "Existing approaches (...) often rely on heuristic binning or nonparametric estimators, which undermine their adaptability to continuous-time settings and complex model outputs." (Besides, KS is also a nonparametric estimator.)
-
Page 1: "but their reliance on fixed sampling schemes or predefined percentiles may limit adaptability."
-
Page 9: "KSP offers a theoretically grounded framework that avoids the limitations of bin-based or sampling-dependent approaches."
-
I find related work sections "after the fact" not very helpful. Apparently, these are works that did not inspire the current contribution, nor are they important to put the work in context nor are they put in a different context based on the results in this contribution. I believe this whole section should be taken out, or whatever is really relevant should be moved to the front.
-
A critical discussion in which the pros and cons of this new method in the context of earlier methods and results is lacking. In addition, none of the related works are discussed in the new context provided by the proposed method
问题
-
How exactly do the theoretical results presented provide guarantees for KSP's performance? It is unclear how the theoretical results presented carry over to the actual finite-sample behavior of the method.
-
When can we expect the current method to work best and under what settings will one or more of the other methods probably outperform the current one?
局限性
Yes.
格式问题
None
We thank the reviewer for their thoughtful and constructive comments. We respond to each point below and will make all necessary clarifications and corrections in the final version.
. As all approaches considered are basically heuristics, it would have been good to try and understand which method performs best in what situation. Currently, some benchmark datasets are used, but carefully designed synthetic data could provide much better insight into the inner behavior of the various methods.
The SUPPORT dataset contains more than 100 tied events at multiple time points, with a particularly severe accumulation of ties at early time points. In the GBSG dataset, approximately 19% of the observations are censored at the final recorded time. Due to the presence of heavy ties and skewness in both datasets, the performance of KSP appears inferior to that of quantile-based approaches, such as CSD and CSD-iPOT. We believe that the performance of KSP could be improved by incorporating tie-breaking strategies or reformulating the method to account for ties explicitly.
In addition, we observe that KSP tends to perform more effectively when applied to datasets with fewer ties, to models that are not discretized, and in settings where the calibration error is relatively large. As illustrated in several calibration plots, KSP also appears to capture tail probabilities better. This may be attributed to its objective of minimizing the maximum discrepancy, which encourages uniformly improved calibration across the entire range of probabilities, regardless of the interval. To gain a clearer understanding of these behaviors, it may be valuable to conduct additional controlled experiments on synthetic data that allow us to systematically vary key factors such as the degree of censoring or the presence of ties. We will further elaborate on this issue in the conclusion.
. The suggested method overcomes limitations of earlier methods due to computational complexity and discretization. It is, however, unclear how those earlier methods would perform, in principle, if we could have (close to) unlimited compute and the binning can be coarsened arbitrarily. Are the reported gains primarily the result of computational improvements and the continuous formulation or is there more?
We believe there are three primary reasons for the superiority of KSP.
One of the main advantages of KSP lies in its continuous formulation, which allows it to achieve calibration without relying on predefined intervals. This formulation makes it fundamentally different from binning-based approaches and enables more consistent calibration behavior across the probability space.
In addition, KSP offers significantly lower computational cost, especially on large-scale datasets. Even with unlimited computing resources, KSP remains far more efficient due to its formulation, which avoids costly sampling or discretization procedures. By approximating the supremum of the calibration error, KSP also has the potential to promote uniform calibration across the entire probability range.
Another important strength of KSP lies in its explicit optimization of the calibration error. This makes the method more interpretable and accessible for practitioners, as it directly aligns with the calibration objective. In contrast, conformal prediction-based approaches rely on implicit calibration driven by theoretical guarantees under assumptions such as exchangeability. When these assumptions do not hold in practice, the reliability of the calibration guarantee may deteriorate.
. I think the part of the paper leading up to Theorem 3.1 isn't clear to me, which probably makes me underappreciate the claim in Theorem 3.1. For one, I don't readily see that "Eqn. (3) is equivalent to (...)". Additionally, the text states, "We begin with the simplified setting without covariates," yet the general result still applies. The paragraph at line 133 offers some observations and links to other methods, but it is unclear to me what I should take away. In the actual formulation of the theorem, (and maybe this is a silly question but) what does it mean for "calibration to hold"? Ultimately, the theorem is an asymptotic result and doesn't necessarily say anything about the finite sample behavior of KS-cal.
Sorry for the confusion. The Eqn (2) is the sample mean where each sample has the same distribution as that of the random quantity in the bracket of Eqn. (3). Therefore, the expectation of (2) is equal to that of the random quantity in the bracket of Eqn. (3) when given Also, on line 133, we merely describe the shape of having jumps on uncensored points and linearly interpolated on censored points. These mixture types are the main characteristic of not depending on fixed bins.
Furthermore, we'll provide more detailed explanations for the Theorem. Due to the limitations of space on paper, we have merged two theorems, which may cause some confusion. We state cases with/without covariates. The first is the case of non-covariate (i.e., independent and identically distributed, or iid) cases, and the second is the case of having covariates (i.e., independent cases). Without covariates, the following the c.d.f. of uniform distribution implies the calibration. When considering the covariates, calibration is defined as follows: (3) with an expectation over The theorem states that the KS-cal goes to 0 if and only if the calibration holds. If the space limit of the paper is relaxed, we can state two theorems. The calibration with/without covariates has relatively common and different definitions.
In addition, the finite sample behavior can be stated by letting be a sequence depending on since we used a large deviation such as Bernstein's inequality in the proof. It reveals the finite behavior. More rigorously, with probability , we can have such that
\begin{eqnarray*}
\vert \tilde{F}(x) - x \vert < a_N / \sqrt{N}
\end{eqnarray*}
where Here, and are large and small, respectively.
. There are some statements that sound appealing, but remain unsubstantiated, as far as I can see. Some examples:
Abstract: ``Existing approaches (...) often rely on heuristic binning or nonparametric estimators, which undermine their adaptability to continuous-time settings and complex model outputs." (Besides, KS is also a nonparametric estimator.)
Thank you for the comment. We agree that KS itself is a nonparametric estimator. Our point was to highlight that KM-based methods rely on an additional nonparametric estimator, namely the KM estimator, as part of their calibration procedure. Unlike our approach, which evaluates the discrepancy between and the theoretical target , KM-based methods assess calibration using the KM estimator. This reliance can introduce additional variance or instability in the calibration process. We will revise the sentence to clarify this distinction more explicitly.
Page 1: ``but their reliance on fixed sampling schemes or predefined percentiles may limit adaptability."
In contrast to KSP, prior methods are designed based on predefined quantities such as the number of quantiles, which limits their flexibility and adaptability. We intended to emphasize this distinction, and we will make an effort to improve the clarity of this point in the revised version.
Page 9: ``KSP offers a theoretically grounded framework that avoids the limitations of bin-based or sampling-dependent approaches."
We acknowledge that other methods also have solid theoretical foundations. We intended to highlight that KSP avoids the need for additional components, such as binning in X-cal or KM sampling in CSD. We apologize that the original phrasing may have caused confusion, and we will revise the sentence to reflect our intended meaning better.
. I find related work sections ``after the fact" not very helpful.
In the “Related Work” section, our intention was to highlight notable general studies on evaluating survival predictions, as well as specific works on in-processing and post-processing methods for survival prediction. We agree with the reviewer that some of these references may not be directly relevant to our current work. As suggested, we will move the “Related Work” section to an earlier part of the paper and remove those references that are not closely related to the present study.
. A critical discussion in which the pros and cons of this new method in the context of earlier methods and results is lacking. In addition, none of the related works are discussed in the new context provided by the proposed method.
The merit of our algorithm lies in its use of a continuous metric, such as KS, which offers a stronger foundation than bin-wise approaches. Despite its theoretical rigor, the computational burden remains low, and empirical results demonstrate clear improvements in calibration. One limitation is that the post-processing transformation may have a somewhat lower representational capacity. However, in practice, our results consistently validate the effectiveness of our method, suggesting potential for further enhancement in future work. We also apologize for the earlier confusion in articulating the main motivation behind KSP. It is indeed inspired by the works of [1] and [2], and we will revise the manuscript to highlight this point.
[1] Fernández, T., Gretton, A.. A maximum-mean-discrepancy goodness-of-fit test for censored data.
[2] Gupta, K. et al. Calibration of Neural Networks using Splines.
The formula of can be replace by
This paper proposes a post-processing method to improve calibration in survival models. Existing approaches often rely on heuristic binning, nonparametric estimators, or incur significant computational cost. Inspired by Platt scaling, the proposed method introduces a transformation function optimized via backpropagation to minimize a Kolmogorov-Smirnov (KS) version of D-calibration. Empirical results demonstrate that the proposed method outperforms existing techniques in terms of both calibration performance and computational efficiency.
优缺点分析
Strengths:
- The motivation is clearly presented.
- The empirical section is extensive and includes useful ablation studies.
- The paper is generally well-written and accessible.
Weaknesses:
-
The paper claims that the KS test has been underutilized in survival calibration, which is misleading. Numerous adaptations of the KS test for right-censored data already exist and are covered in standard survival analysis textbook and literature [1–3]. Furthermore, the proposed use of a KS-based loss function resembles techniques based on Cox–Snell residuals and their calibration assessments [4, 5]. A more thorough discussion of these related methods and their differences from the proposed approach is essential.
-
While the motivation and background are well-explained, the main methodological section (Section 4) is surprisingly brief. Although simplicity can be a virtue, the proposed method currently appears heuristic and lacks theoretical depth. Several technical questions remain unanswered:
- Can the transformation function in line 5 of the algorithm approximate arbitrary survival functions (i.e., is it identifiable)?
- Is the objective function (when combined with the KS loss) convex or well-behaved during optimization?
- Why are exactly three trainable parameters sufficient, and under what conditions? Table 25 shows that 3-parameter is better than 0/1/2, but the justification needs to be made more rigorous for 4+ parameters.
- Is there any formal guarantee on calibration, as provided in recent conformal-based approaches?
-
The evaluation uses 20 bins, while the competing methods (e.g., CSD and CSD-iPOT) were originally designed or evaluated using 10 bins. This discrepancy could influence the relative performance. A fair comparison using consistent bin counts is necessary.
-
Proposition 4.1 assumes the non-crossing property of survival curves, which limits its applicability. The statement that “this condition is satisfied by many commonly used models” is inaccurate. In practice, only CoxPH and WeibullAFT (and their NN-extensions) satisfy this condition. This limitation should be discussed more.
-
Figures 3 and 15–24 use histograms to compare calibration performance, but these are difficult to interpret and obscure the actual differences between methods. For instance, in Figure 3’s leftmost panel, the CSD method appears slightly better than the proposed one. A QQ plot might better illustrate deviations from perfect calibration. Also, the small size of panels in Figure 4 makes error bars and trends hard to see.
-
The transformation function is described as “monotonic,” but this should be refined to “strictly monotonic.” A merely monotonic function may contain flat regions, leading to tied CDF values and potentially compromising evaluation metrics such as the time-dependent concordance index. Since all functions evaluated are strictly monotonic, this is a minor but important clarification.
[1] Chapter 7, Survival Analysis Techniques for Censored and Truncated Data. Klein and Moeschberger
[2] Modified Kolmogorov-Smirnov Test Procedures with Application to Arbitrarily Right-Censored Data. Biometrics, 1980
[3] Two-Sample Tests of Cramér--von Mises- and Kolmogorov--Smirnov-Type for Randomly Censored Data. International Statistical Review. 1984.
[4] An evaluation of the Cox-Snell residuals. Elin Ansin. Master Thesis
[5] Randomized Survival Probability Residual for Assessing Parametric Survival Models. Tingxuan Wu. Master Thesis
问题
- Why is Park et al.’s in-processing method not included in the comparison?
- What is the purpose of the left panel in Figure 2, which aggregates performance across all datasets and models? It shows little difference among the four methods, including the uncalibrated one. Does this imply that in most settings, the original models are already well-calibrated, and post-processing yields only marginal improvements?
- The grid used for tuning the regularization parameter is large: {1, 10, 100, 1000}. With such strong regularization, wouldn’t the parameters shrink toward 0s, effectively yielding little adjustment from the original CDF? Again, this raises the question of whether the original models are already sufficiently calibrated.
局限性
No limitation is discussed in the paper.
最终评判理由
Thanks to the author for their rebuttal. My concerns are mostly resolved.
格式问题
N/A
We thank the reviewer for their thoughtful and constructive comments. We respond to each point below and will make all necessary clarifications and corrections in the final version.
. The paper claims that the KS test has been underutilized in survival calibration, which is misleading. Numerous adaptations of the KS test for right-censored data already exist and are covered in standard survival analysis textbook and literature [1–3]. Furthermore, the proposed use of a KS-based loss function resembles techniques based on Cox–Snell residuals and their calibration assessments [4, 5]. A more thorough discussion of these related methods and their differences from the proposed approach is essential.
Thank you for the insightful comment and for providing the five references. When we described the KS test as underutilized, what we meant was that in the context of survival analysis, there have been no existing methods specifically designed to reduce calibration error based on the KS metric. We recognize that our original wording may have caused confusion, and we apologize for any inconvenience this may have caused.
In fact, we have previously used Cox-Snell residuals in our experiments. However, since Cox-Snell residuals are based on the cumulative hazard function, they are often less straightforward to use than CDFs. In addition, for censored observations, Cox-Snell residuals typically rely on surrogate estimates such as , which motivated us to explore alternative formulations. We appreciate the references provided and will ensure that we cite and discuss them in our revised manuscript.
. While the motivation and background are well-explained, the main methodological section (Section 4) is surprisingly brief. Although simplicity can be a virtue, the proposed method currently appears heuristic and lacks theoretical depth. Several technical questions remain unanswered:
Can the transformation function in line 5 of the algorithm approximate arbitrary survival functions (i.e., is it identifiable)?
KSP is a one-to-one transformation and thus identifiable, which gives it a clear structural advantage. While we acknowledge that this is still an early stage of research, the current formulation of KSP has demonstrated strong and consistent performance across various models. We believe this formulation offers a solid foundation, and we are optimistic about its ability to approximate a wide range of survival functions. Nevertheless, we recognize that further theoretical and empirical exploration will help us better understand its capacity and limitations.
Is the objective function (when combined with the KS loss) convex or well-behaved during optimization?
Although the KS loss is not convex, its behavior is mild under gradient-based optimization, and it converges reliably in practice. Compared to X-cal, KS loss is slightly faster and does not require a surrogate loss function, which simplifies implementation. One important consideration, however, is that using a very small batch size can lead to noticeable approximation error. Therefore, selecting an appropriate batch size is crucial for achieving stable performance.
Why are exactly three trainable parameters sufficient, and under what conditions? Table 25 shows that 3-parameter is better than 0/1/2, but the justification needs to be made more rigorous for 4+ parameters.
Thanks for your comment. It is indeed possible to introduce more than four parameters while preserving the time-dependent C-index. However, based on our preliminary experiments using deep neural networks having numerous parameters, we observed a tendency toward overfitting, indicating that careful design and regularization are needed. We will include additional experimental results related to this observation in the revised version.
Is there any formal guarantee on calibration, as provided in recent conformal-based approaches?
Yes. While we did not emphasize this in the paper, by letting be a sequence that depends on , we can infer the convergence rate from the proofs in the appendix. It shows that converges at the rate of up to We agree that highlighting this point in the paper would strengthen the presentation, and we will revise the manuscript accordingly.
. The evaluation uses 20 bins, while the competing methods (e.g., CSD and CSD-iPOT) were originally designed or evaluated using 10 bins. This discrepancy could influence the relative performance. A fair comparison using consistent bin counts is necessary.
To evaluate the metrics that depend on bin locations, we used 10 equally sized bins: . The table below reports only the D-calibration metrics, following the format of Table 1 in the main text, where the number of wins for each method is counted. Under this 10-bin setting, somewhat surprisingly, KSP still outperforms CSD in most cases, although with a reduced number of wins. It is worth noting that, with 10 bins, the bin boundaries exactly align with the quantile locations used during inference. This alignment tends to favor quantile-based methods such as CSD and may not provide an equally favorable setting for other methods.
A truly well-calibrated method should maintain consistent performance regardless of the specific binning scheme. Therefore, using bins that do not align with a model’s inference quantiles can provide a fairer basis for comparison. Since the X-cal paper adopted 20 bins of width 0.05, we followed the same configuration in our main experiments.
\begin{array} \hline \textsf{Method} & \textsf{S-cal(10)} & \textsf{D-cal(10)} & \textsf{S-cal(20)} & \textsf{D-cal(20)} & \textsf{KS-cal} \\ \hline \textsf{KSP} & \textsf{48 (45)} & \textsf{46 (43)} & \textsf{46 (45)} & \textsf{46 (43)} & \textsf{47 (45)} \\ \textsf{Non-calibrated} & 12 (8) & 14 (11) & 13 (7) & 14 (6) & 13 (5) \\ \textsf{Tie} & 0 & 0 & 1 & 0 & 0 \\ \hline \textsf{KSP} & \textsf{36 (28)} & \textsf{32 (29)} & \textsf{36 (29)} & \textsf{48 (45)} & \textsf{51 (42)} \\ \textsf{CSD} & 24 (19) & 28 (26) & 24 (19) & 12 (10) & 9 (8) \\ \textsf{Tie} & 0 & 0 & 0 & 0 & 0 \\ \hline \end{array}
. Proposition 4.1 assumes the non-crossing property of survival curves, which limits its applicability. The statement that “this condition is satisfied by many commonly used models” is inaccurate. In practice, only CoxPH and WeibullAFT (and their NN-extensions) satisfy this condition. This limitation should be discussed more.
We acknowledge that the statement may have been overstated. Among the models we used, only two strictly satisfy the required condition. However, as shown in our experimental results, the time-independent C-index is nearly preserved even for the other models. Regardless of the discrimination measure used, KSP does not significantly degrade discrimination power. We agree that this point deserves further discussion and will revise the corresponding section to reflect this nuance better.
. Figures 3 and 15–24 use histograms to compare calibration performance, but these are difficult to interpret and obscure the actual differences between methods. For instance, in Figure 3’s leftmost panel, the CSD method appears slightly better than the proposed one. A QQ plot might better illustrate deviations from perfect calibration. Also, the small size of panels in Figure 4 makes error bars and trends hard to see.
Thank you for the suggestion. We will revise the figure accordingly.
. The transformation function is described as “monotonic,” but this should be refined to “strictly monotonic.”
Thank you for your constructive comments. We will revise the wording to ``strictly monotonic" for clarity and accuracy.
. Why is Park et al.’s in-processing method not included in the comparison?
When comparing in-processing methods, we selected representative approaches from each line of work. The method proposed by Park et al. is a variant of X-cal. Therefore, we did not include it separately. While their method may outperform X-cal in some settings, the fundamental trade-off between discrimination and calibration inherent to in-processing approaches still applies. Therefore, we did not consider it essential to include this method in our comparison.
. What is the purpose of the left panel in Figure 2, which aggregates performance across all datasets and models? It shows little difference among the four methods, including the uncalibrated one. Does this imply that in most settings, the original models are already well-calibrated, and post-processing yields only marginal improvements?
Since the figure integrates results from all 60 cases, some overlapping occurs in the bar plot. This is because both low and high calibration error scenarios are included. For models that already exhibit low calibration error, post-processing tends to make minimal difference. However, for models with higher calibration error, such as CRPS, the reduction is clearly substantial. What we aimed to highlight is the median value. KSP consistently reduces calibration error to near-best levels across all cases. We included this figure to emphasize that point, and we will revise the plot to improve clarity and interpretation.
. The grid used for tuning the regularization parameter is large: {1, 10, 100, 1000}.
Using a regularization parameter up to 1000 did not pose a significant issue in our experience. Even with , the models trained properly, although they may require more training epochs. We followed a similar range as reported in the X-cal paper, which also experimented with values of up to 1000.
Thanks to the author for their rebuttal. My concerns are mostly resolved. However, I still have a small question.
It is indeed possible to introduce more than four parameters while preserving the time-dependent C-index. However, based on our preliminary experiments using deep neural networks having numerous parameters, we observed a tendency toward overfitting, indicating that careful design and regularization are needed. We will include additional experimental results related to this observation in the revised version.
If overfitting is observed when you choose >4 parameters, can we add regularization to the parameters (e.g., l1 or l2)? I'm looking forward to seeing these results in the revised manuscript.
That being said, I'm happy about the rebuttal and will raise my score.
Thank you for your thoughtful suggestion. We concur that incorporating regularization is a good approach to address the overfitting. We will revise the manuscript accordingly to include this extension. If time permits, we will also share preliminary results prior to the end of the rebuttal period.
We sincerely appreciate your constructive feedback and will revise the manuscript accordingly.
More precisely, the converges in the answer concerning the theoretical aspects.
The paper introduces KSP (Kolmogorov-Smirnov post-hoc calibration), a novel method for calibrating survival models using the Kolmogorov-Smirnov metric. The authors address an important challenge in survival analysis by providing a robust, flexible, and theoretically grounded approach that enhances calibration without significantly compromising predictive performance.
优缺点分析
Some of the key strengths of this paper are:
- The use of the well-established KS metric ensures a solid theoretical basis for the method, making it credible and reliable. Further, KSP demonstrates effectiveness across various survival models and datasets, showcasing its versatility.
- The computational efficiency of KSP, requiring fewer iterations compared to other methods like CSD or CSD-iPOT, makes it practical for handling large datasets and complex models.
- The method maintains high predictive accuracy while improving calibration, which is crucial in high-stakes domains such as healthcare and infrastructure.
Some of the aspects where this can be improved upon are:
- There are instances where KSP performs less well than other methods across specific dataset conditions or small sample sizes. It is important to analyze these scenarios thoroughly to identify "operating regions" of the metric
- Continuing on the above, the method may not capture calibration error as effectively in very small sample sizes, which could be a limitation in certain scenarios.
Overall, the paper presents an innovative solution to a significant problem in survival analysis, offering a method that is both theoretically sound and empirically validated. It has the potential to significantly impact the development and deployment of survival models in real-world applications. The authors are encouraged to explore further enhancements and validations to address its limitations and expand its applicability.
问题
- Why does KSP sometimes underperform compared to other methods in specific datasets? What steps can be taken to enhance its performance across these scenarios?
- How can KSP be adapted to effectively handle cases with extremely small sample sizes or high censoring rates? Is this feasible?
- What future work is planned to further validate and generalize the method across a broader range of survival analysis scenarios, including extreme cases and challenging data conditions?
局限性
n/a
最终评判理由
Overall I am satisfied by the response to my questions. I am curious to see how the other reviewers consider the responses to their questions, but to be fair in my review and the ability of the authors to respond to particular queries, I am keeping my current review and scores as is
格式问题
n/a
We thank the reviewer for their thoughtful and constructive comments. We respond to each point below and will make all necessary clarifications and corrections in the final version.
. Why does KSP sometimes underperform compared to other methods in specific datasets? What steps can be taken to enhance its performance across these scenarios?
The SUPPORT dataset contains more than 100 tied events at multiple time points, with a particularly severe accumulation of ties at early time points. In the GBSG dataset, approximately 19% of the observations are censored at the final recorded time. Due to the presence of heavy ties and skewness in both datasets, the performance of KSP appears to be inferior to quantile-based approaches such as CSD and CSD-iPOT. We believe that the performance of KSP could be improved by incorporating tie-breaking strategies or reformulating the method to explicitly account for ties. We will further elaborate on this issue in the conclusion.
. How can KSP be adapted to effectively handle cases with extremely small sample sizes or high censoring rates? Is this feasible?
One possible direction for improvement is to enhance the nonlinearity of the KSP formulation using deep neural networks, or alternatively, to simplify its complexity. Incorporating appropriate weighting schemes may also help. However, challenges such as small sample size or high censoring rates are not specific to KSP. These are fundamental difficulties that are generally hard to overcome, regardless of the method used.
\begin{array} \hline Method & C-index & S-cal(20) & D-cal(20) & KS-cal & KM-cal & IBS \\ \hline Non-calibrated & 0.73462 & 0.009487 & 0.029233 & 0.139288 & 0.037962 & 0.204758 \\ KSP & 0.75688 & 0.002120 & 0.021412 & 0.096729 & 0.016621 & 0.182144 \\ \hline \end{array}
We applied KSP to the pbc dataset from the R survival package, which contains 250, 84, and 84 samples for training, validation, and testing, respectively, with a censoring rate of 61%. As shown in the table, the KSP method clearly reduces calibration error. Notably, it also improves the C-index, indicating that KSP can enhance performance in small-sample settings where baseline models often struggle. However, we acknowledge that the improvement may not yet be sufficient to satisfy practical or research expectations fully. Addressing this limitation remains an important direction for future work.
. What future work is planned to further validate and generalize the method across a broader range of survival analysis scenarios, including extreme cases and challenging data conditions?
As discussed in the two points above, a promising direction for future work is to revise the KSP formulation to better accommodate datasets with a large number of tied events or extreme characteristics such as small sample sizes and high censoring rates. This may involve drawing inspiration from existing methods, such as approaches for handling ties like the Efron approximation, or weighting ideas that underlie the log-rank test, while developing solutions specifically designed for calibration-focused objectives.
Thank you for your detailed response!
We sincerely appreciate your constructive feedback and will revise the manuscript accordingly.
Dear reviewers,
We’ve put a lot of effort into addressing the comments and would really appreciate it if you could take a moment to review our response. If you have any thoughts or feedback, we’d be grateful to hear them.
This paper proposes a post-hoc calibration approach for survival analysis based on the Kolmogorov-Smirnov metric. This proposed approach is well-motivated and theoretically sound, and the experiments were extensive. During the course of the author/reviewer discussion, the authors addressed the weaknesses well with helpful clarifications and additional experiments. Of the 4 reviewers, 3 ended up favoring acceptance (1 accept, 2 borderline accept), and a dissenting reviewer voted for borderline rejection (note however that the dissenting reviewer stopped being responsive and never acknowledged the author rebuttal; I found that their concerns were sufficiently addressed by the authors). Thus, I am recommending acceptance for this paper.