PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
4
4
5
4.0
置信度
创新性2.5
质量2.5
清晰度3.0
重要性2.8
NeurIPS 2025

Are Pixel-Wise Metrics Reliable for Computerized Tomography Reconstruction?

OpenReviewPDF
提交: 2025-05-06更新: 2025-10-29
TL;DR

CARE: Completeness Aware Reconstruction Enhancement

摘要

关键词
ReconstructionDiffusion Model

评审与讨论

审稿意见
4

The authors noticed an important problem in the measurement of the quality of the reconstructed CT images. The authors proposed a non-pixel-based metric for measurement, the metric is proposed based on the segmentation results from a frozen nnUNet pretrained on single institutional CT dataset. Although the intention of the study design is reasonable, there are still some concerns that need to be aware of.

优缺点分析

Strengths: The authors noticed an important problem in the measurement of the quality of the reconstructed CT images. The intention of the proposal is great. Ablation studies conducted regarding different CT reconstruction methods are appropriate.

Weaknesses:

  1. Insufficient numbers of literature reviews for non-pixel metrics and evaluations.
  2. It’s very hard to be convinced when the authors proposed a frozen-model-based metric trained on ~3000 cases from a single institution, and maybe single brand of CT machines. In addition, the testing set is only ~60 cases, which might be even lower than expectation if the proposal is related to a general metric.
  3. The proposed metric was not applied to multiple important organs, like brain, prostate, lung, heart, etc. The generalizability and applicability of the metric is relatively limited.
  4. The proposed metric is not evaluated in comparison with the observation and evaluation from humans, in this case radiologists, which is important.
  5. Evaluation is only conducted with respect to organs, ignoring the importance of the surrounding tissues, which are also important.

问题

  1. For the related works of non-pixel evaluation, I would like to request the authors to provide deeper investigations and results comparisons. It is very easy to find similar works, for example but not limited to: https://arxiv.org/abs/2308.00402, https://arxiv.org/pdf/2306.04675, FID, IS, etc.
  2. I would like to ask the authors the reasoning why you believe a frozen-model-based metric trained on ~3000 cases can be reliable.
  3. In figure 3, what is the meaning of large/small, and also what are the meanings of the numbers shown in x-axis an y-axis?
  4. I’m very curious why the authors take vessel and intestine out for extra evaluations. Also, are the “big” & “small” organs defined by physicians following any clinical guidelines? Or is it just arbitrarily grouped? I think this might be too arbitrary if the later case holds.
  5. If you want to propose a metric for CT quality measurement in general, I think it would be much reasonable if you can include all organs & tissues that will be tested by CT, like brain, lung, fat, prostate, bladder etc.
  6. I would like to ask the authors to briefly mention CT parameters, and patient cohort attributes in the manuscript to let the reviewers see if they are generalizable enough, or diverse enough.
  7. Why didn’t the authors conduct study to compare with the thoughts/evaluations from radiologists? For example, to evaluate if the quality measured by CARE is aligned with the quality measured by radiologists’ visual evaluation? This is very important in terms of medical imaging. In addition, this comparison has also to be conducted not only for organs, but also for other tissues, since the organs are not the ONLY important things for clinical decision making, but also sometimes the background - like the muscle, fat, or even contrast, they are also important.

局限性

N/A

最终评判理由

Although some questions still remain unclear, the authos resolved most of my comments and concerns well.

格式问题

N/A

作者回复

Thank you very much for your thoughtful review and constructive feedback.


Q1. Insufficient numbers of literature reviews for non-pixel metrics and evaluations.

  • We have expanded the related works section to discuss a broader range of non-pixel-based metrics and evaluation methods, such as FID, LPIPS, KID, MS-SSIM, GLIPS, GMSD, etc.

Q2. It’s very hard to be convinced when the authors proposed a frozen-model-based metric trained on ~3000 cases from a single institution, and maybe single brand of CT machines. In addition, the testing set is only ~60 cases, which might be even lower than expectation if the proposal is related to a general metric.

  • Our segmentator was trained on ~3,000 CT scans from four different machines (GE, Siemens, Philips, Toshiba) covering ~45% of the clinical market, ensuring diversity.
  • For the ~60 cases in reconstruction, it is so far the largest CT reconstruction testing set. All baseline methods in Tab. 1 used the following numbers of CT cases for testing:
MethodTotal CT ScanHuman Body CT Scan
InTomo74
NeRF200
TensoRF200
R2-GS.155
NAF54
FDK10
SART11
ASD-POCS20
SAX-NeRF159
CARE6161
  • The reason for these small testing set sizes is: for each CT volume, each reconstruction method must be trained individually.

Q3. The proposed metric was not applied to multiple important organs, like brain, prostate, lung, heart, etc. The generalizability and applicability of the metric is relatively limited.

  • We intentionally focused on abdominal organs due to the availability of reliable segmentation methods essential for high-quality evaluation and enhancement.
  • Our approach can be readily extended to datasets and segmentation models involving other organs in future work.

Q4. The proposed metric is not evaluated in comparison with the observation and evaluation from radiologists, which is important.

  • Our Segmentator used for evaluation is trained on ground-truth annotations provided by 21 radiologists over 3 years.
  • We have now also conducted a blinded study: all 21 radiologists preferred CARE-enhanced reconstructions.
  • Fig. 6 in the paper and Supp. Sec. E.2 visually compare the image quality with and without CARE, making the improvement evident even to non-experts.

Q5. Evaluation is only conducted with respect to organs, ignoring the importance of the surrounding tissues, which are also important.

  • To clarify, we include vessels as an example of relevant surrounding tissue.
  • We have now updated the manuscript to clarify this limitation and emphasize the potential for broader tissue evaluation in future work, as follows: “At present, the evaluation is limited by the lack of high-quality segmentation datasets and models for many surrounding tissues. CARE can readily incorporate additional tissues as soon as such datasets become available, ...”

Q6. For the related works of non-pixel evaluation, I would like to request the authors to provide deeper investigations and results comparisons. It is very easy to find similar works, for example but not limited to: (Scholz et al., 2023), (Stein et al., 2023), FID, IS, etc.

  • We have expanded the related works section to discuss non-pixel metrics. (details in Q1)
  • Regarding the specific works you mentioned:
    • (Scholz et al., 2023): This metric is designed for MRI, and the domain gap with CT limits its direct applicability. Moreover, the lack of released code makes comparison difficult.
    • (Stein et al., 2023): This metric uses features from neural networks trained on natural images, which may not transfer well to CT due to significant domain differences.
    • FID and IS: Both rely on InceptionV3 features pre-trained on ImageNet, which may not be meaningful for CT images.
  • Existing metrics generally lack direct assessment of local anatomical detail, which our approach targets.

Q7. I would like to ask the authors the reasoning why you believe a frozen-model-based metric trained on ~3000 cases can be reliable.

  • Our segmentator achieves expert-level accuracy on a 1,958-case radiologist-annotated test set (see Rev. Wkf4-Q4), and metrics derived from it are highly correlated with human annotation (see Fig. 3).
  • CARE’s modular design allows easy integration of future segmentation improvements to further enhance reliability.

Q8. In Fig. 3, what is the meaning of large/small, and also what are the meanings of the numbers shown in x-axis an y-axis?

  • In Fig. 3, “large” refers to large organs, and “small” refers to small organs. And regarding the x and y axes:
AxisMeaningInput being evaluatedGold Standard Used
x-axisSegmentator-based metricsegmentation on the evaluated CTsegmentation on the gold-standard CT
y-axisGT-based metricsegmentation on the evaluated CTHuman expert annotation

Q9. I’m very curious why the authors take the vessel and the intestine out for extra evaluations. Also, are the “big” & “small” organs defined by physicians following any clinical guidelines? Or is it just arbitrarily grouped? I think this might be too arbitrary if the later case holds.

  • We evaluate vessels and the intestine separately because

    • They are elongated, tubular structures whose primary clinical relevance lies in preserving connectivity.
    • The spatial relationship between vessels and tumors is critical for surgical planning and subsequent treatment strategies, making accurate reconstruction of these structures particularly valuable.
  • Our grouping strategy for large and small organs is based on clinical guidelines. We partition organs by typical in vivo volumes reported in radiologic volumetry studies. | Classification | Representative mean volume (mL)| Key references | | ------- | --- | ---- | | Large organs ( > ~100 mL)| Liver ≈1200–1700; Kidney ≈150–200 ; Spleen ≈215 ; Pancreas ≈71–83 | (Geraghty, E M et al., 2004) | | Small organs / structures (< ~50 mL or < 10 mm diameter) | Gallbladder ≈30–50 mL ; Each adrenal gland ≈4–6 mL ; Celiac trunk diameter ≈6–12 mm ; Duodenum lumen 2–3 cm, wall volume << 50 mL | (Schick, 2022) |

  • The 100 mL / 10 mm cutoff aligns with standard thresholds for distinguishing major organs from accessory structures in abdominal atlases.


Q10. If you want to propose a metric for CT quality measurement in general, I think it would be more reasonable if you can include all organs & tissues that will be tested by CT, like brain, lung, fat, prostate, bladder, etc.

  • No dataset currently covers all organs/tissues with expert-annotated segmentations.
  • To ensure accuracy and reliability, we focus on abdominal structures due to data quality and model performance, but CARE is extensible as the field improves.

Q11. I would like to ask the authors to briefly mention CT parameters, and patient cohort attributes in the manuscript to let the reviewers see if they are generalizable enough, or diverse enough.

  • In our revised manuscript, we have included the following information for the segmentation and reconstruction datasets to ensure both diversity and generalizability:
CT scanAge (mean ± SD)Female, no. (%)Male, no. (%)In-plane spacing, mm (IQR)White, no. (%)Black, no. (%)Asian, no. (%)Other, no. (%)Resolution (px)height (px)Voxel Spacing (mm)PDACPNETCyst
Segmentator Training set315164.9 ± 8.91524 (48.4)1627 (51.6)0.71 (0.66, 0.77)623 (19.8)619 (19.6)632 (20.1)661 (21.0)512 × 512678 ± 2010.72 ± 0.091119591429
Segmentator Test set195864.9 ± 8.91008 (51.5)950 (48.5)0.73 (0.68, 0.79)409 (20.9)394 (20.1)389 (19.9)397 (20.3)512 × 512690 ± 1960.74 ± 0.09554327512
Reconstruction Training set2564.2 ± 8.910 (40.0)15 (60.0)0.7 (0.68, 0.79)6 (24.0)4 (16.0)5 (20.0)6 (24.0)512 × 512350 ± 3020.74 ± 0.12
Reconstruction Test set3664.8 ± 9.216 (44.4)20 (55.6)0.77 (0.7, 0.98)4 (11.1)7 (19.4)10 (27.8)5 (13.9)512 × 512427 ± 2941.20 ± 1.04

Q12. Why didn’t the authors compare CARE’s assessments with radiologists’ visual evaluations? Such comparison is crucial for medical imaging and should include not only organs but also other clinically relevant tissues like muscle, fat, and contrast.

  • We did conducted a radiologist reader study, as the 3,151 ground truth segmentations used to train the Segmentator were manually annotated by 21 experienced radiologists over 3 years, ensuring CARE assessments align with expert evaluation across organs, vessels, and other clinically relevant structures.
  • We have now added results from a blind study with 21 radiologists and added in the revision: “A blind reader study with 21 radiologists showed consistent preference for CARE-enhanced reconstructions in terms of overall image quality, supporting the clinical relevance of our method.”
  • About other clinically relevant tissues (e.g., muscle, fat),
    • A comprehensive evaluation is currently limited by available annotated datasets and segmentation accuracy for these tissues. (details in Q10)
    • As segmentation models advance, CARE can be readily extended to more tissues and structures.
  • Our current implementation encodes CT contrast via a text encoder for targeted enhancement.

Reference

  • Geraghty, E M et al. “Normal organ volume assessment from abdominal CT.” Abdominal imaging, 2004.
  • Schick, Fritz. “Automatic segmentation and volumetric assessment of internal organs and fatty tissue: what are the benefits?.” Magma (New York, N.Y.), 2022.
评论

Thanks the authors for trying to resolve my comments, I have raised my rating. I think majority of the replies make sense to me. However, there still some concerns

  1. Q3: I still think more organ evaluation is necessary for proposing a generic metric, and the limitation of existing reliable segmentation methods should not be considered as an excuse.
  2. Q6: I don't fully agree on this reply. SSIM and PSNR are also not proposed based on medical imaging in general, then how do you think it is valid to use them for evaluation? In addition, there are reference medical image related works using InceptionV3 for evaluation, do you think all these work are invalid?
  3. Q10: I will hold my question if the model can be generalizable enough to tissues like fat and muscle, since the anatomy structure could be different.
评论

Thank you so much for raising your rating. Below are our responses to the remaining three questions, and we hope they address your concerns.


Q3 | More organ evaluation for a “generic” metric

  • We agree that a generic CT reconstruction metric should ideally cover all clinically relevant organs and tissues. Our current focus on abdominal organs, vessels, and intestines is not because CARE cannot handle more, but because the public domain currently lacks large-scale, voxel-wise annotated CT datasets (e.g., n > 2,000) for many other structures—such as fat, muscle, prostate, and bladder—at the accuracy needed for reliable metric computation. This is a limitation of available data in the field, not of CARE itself.

  • For the structures we do evaluate, we leveraged a private large-scale dataset of over >3,000 expert-annotated CT scans from four major scanner vendors. The segmentation model trained on this dataset will be made publicly available, so others can reproduce our metrics and build upon them. This is, in itself, a significant contribution toward enabling reliable, anatomy-aware CT evaluation in the community.

  • Once accurate segmentation models for more types of anatomical structures become publicly available, they can be integrated into CARE without changing its design or training process.


Q6 | Other metrics

  • Following your suggestions, we have now added Stein et al. (2023), FID, and IS as baseline metrics for comparison. FID and IS are good for judging overall visual similarity between images, but they don’t check whether the anatomical structures are actually complete (the main focus of our paper).

  • Our metric is different. It uses a segmentation model trained on over 3,000 voxel-wise annotated CT scans from four scanner brands, so its features are tied directly to organ boundaries and connectivity. We then measure NSD for organ surface accuracy and clDice for vessel/intestine continuity, which are built to detect exactly the kinds of structural errors that FID and IS will miss.


Q10 | Generalization to tissues like fat and muscle

  • CARE is already demonstrated on diverse anatomical classes, including non-solid tubular structures like vessels and intestines, achieving up to +40% clDice improvement (Table 2). This shows the framework’s ability to handle varied morphology. When robust segmentation models for fat and muscle become available, they can be dropped into CARE with no change in method—making the system directly extensible to these tissues.
审稿意见
4

This paper addresses a critical limitation in sparse-view computed tomography (CT) reconstruction: the inadequacy of conventional pixel-wise metrics such as PSNR and SSIM to reliably capture anatomical completeness, especially for small or thin structures essential for clinical use. To overcome this, the authors propose an evaluation framework based on anatomy-aware metrics derived from automated segmentation using a frozen nnU-Net model trained on a large annotated CT dataset. Building on these metrics, they introduce CARE (Completeness-Aware Reconstruction Enhancement), a diffusion-based enhancement framework that incorporates anatomy-guided supervision to improve structural fidelity in reconstructed CT images. The work highlights the importance of anatomy-centric evaluation and enhancement for clinical applicability in low-dose CT imaging.

优缺点分析

Strengths:

(1) The paper tackles a clinically important problem, ensuring anatomical fidelity in sparse-view CT reconstructions, which has direct implications for patient safety and diagnostic accuracy.

(2) The authors conduct extensive quantitative and qualitative experiments across multiple reconstruction baselines, demonstrating consistent and statistically significant improvements in anatomical completeness.

(3) The paper is well-organized and clearly explains the motivation, methodology, and results, including detailed metric definitions and ablation studies.

Weeknesses:

(1) The clDice metric is adopted from prior work and not novel to this paper, which the authors acknowledge. While appropriate, this should be clearly emphasized to avoid overstating originality in metric design.

(2) The clDice metric is adopted from prior work ([29]) and is not novel to this paper, which the authors acknowledge. However, the authors should clearly explain how their use or adaptation of clDice differs from [29], if at all. The contribution regarding the metric design appears limited and may reduce the overall novelty of the paper’s evaluation methodology.

(3) Since anatomy-aware metrics are derived from nnU-Net predictions rather than manual annotations, segmentation errors may lead to inaccuracies or bias in evaluation metrics, potentially causing error accumulation in both evaluation and enhancement stages. This limitation should be discussed explicitly. How do segmentation errors affect the reliability of these metrics? Is there risk of error accumulation in evaluation or enhancement? Could the authors discuss potential mitigation strategies?

(4) CARE introduces additional complexity due to the diffusion model and segmentation supervision. The paper briefly mentions this but could provide more detailed analysis on computational cost, scalability, and applicability to large 3D volumes or real-time scenarios.

(5) The current focus is on abdominal CT scans. The generalization to other anatomies or modalities is suggested as future work but not demonstrated.

问题

Please refer to weekness.

局限性

Yes

最终评判理由

I appreciate the authors’ detailed clarifications to explicitly acknowledge the reuse of clDice and NSD metrics without modification. The additional discussion on segmentation error risks, computational cost, and the explicit statement of abdominal CT as a limitation helps improve clarity.

My main concerns remain: (1) The work’s main novelty lies in applying existing metrics (clDice, NSD) to CT reconstruction evaluation, rather than introducing fundamentally new evaluation methods. While the authors clearly state this now, it limits the contribution’s originality.

(2) The evaluation remains constrained to abdominal CT, without testing on other anatomies or modalities. The authors acknowledge this as future work, but it remains a major gap for a general metric proposal.

(3) The performance relies on segmentation results. Errors in segmentator predictions may propagate to metric estimation and evaluation.

However, I think the insight of this paper is good, which aims to help move the field toward reconstructions that are not only visually sharp but also clinically useful. Therefore, my rating is still borderline.

格式问题

N/A

作者回复

Thank you for recognizing the clinical relevance and methodological clarity of our work.


Q1. The clDice metric is adopted from prior work and not novel to this paper, which the authors acknowledge. While appropriate, this should be clearly emphasized to avoid overstating originality in metric design.

  • We would like to clarify that our work does not propose the clDice metric itself; rather, we utilize the clDice metric as an established tool to support our evaluation framework.

  • We have now revised the manuscript to more clearly emphasize that clDice is adopted from prior work and not a novel contribution of this paper. For example, we now state:

    “We adopt the clDice metric as introduced in [29] to provide a topology-aware evaluation of segmentation performance. The definition and usage of clDice remain unchanged. ”


Q2. clDice is adopted from prior work ([29]) and not novel to this paper. The authors should clarify any differences in its use, as the limited contribution to metric design may reduce the evaluation novelty.

  • To clarify, our use of clDice for evaluating vessel and intestine segmentation follows exactly the same methodology as described in the prior work ([29]), without any modification.
  • Our intent is not to claim novelty in the design or adaptation of the clDice metric itself. Rather, the novelty of our work lies in being the first to apply the clDice metric—originally developed for segmentation tasks—to assess reconstruction quality in the context of CT reconstruction.
    • Similarly, for large and small organs, we adopt the NSD metric in the same spirit, introducing it as an evaluation tool for this new domain.

Q3. Since the metrics rely on nnU-Net predictions rather than manual annotations, segmentation errors may bias evaluation and cause error accumulation. The paper should explicitly discuss this limitation, its impact on metric reliability, and possible mitigation strategies.

  • We agree that relying on nnU-Net predictions rather than manual annotations introduces the risk of segmentation errors, which may bias metric estimation and potentially lead to error accumulation in both evaluation and enhancement. This is an inherent limitation of any evaluation framework based on automated models.

  • To address this, we have analyzed the impact of segmentation errors in Fig. 3 and observed that segmentator-based metrics are highly correlated with GT-based metrics, indicating strong reliability within our experimental setting.

  • We have added the following explicit discussion to clarify this limitation:

    “A limitation of our approach is that metric reliability is contingent on the quality of automated segmentation; errors in segmentator predictions may propagate to metric estimation and evaluation. While our experiments show high correlation between segmentator-based and ground-truth-based metrics, ongoing advances in segmentation models are expected to further mitigate this risk.”


Q4. CARE introduces additional complexity due to the diffusion model and segmentation supervision. The paper briefly mentions this but could provide more detailed analysis on computational cost, scalability, and applicability to large 3D volumes or real-time scenarios.

  • Regarding computational cost,

    • We have provided implementation details, including computational requirements, in Supp. Sec. A.3 of the supplemental materials. We have now clearly referenced this section in the main text. For example:

      “The implementation details, including computational costs and hardware configurations, are provided in Supp. Sec. A.3.”

    • The segmentation model contributes to computational cost only during CARE’s training phase, and the overall increase in training cost is minimal—just +3.28% in the case of SAX-NeRF. (details in Rev. Wkf4 - Q1)

  • In terms of scalability, our framework is designed to be flexible.

    • In terms of models, the Segmentator component can be replaced with other segmentation models as needed; the diffusion model can be replaced with other latent diffsuion backbones.
    • In terms of data, CARE can be trained using more CT reconstructions if available, which will further improve its generalization and performance.
  • With respect to applicability to large 3D volumes,

    • We would like to clarify that while we compute pixel-wise and anatomy-aware metrics in 3D, the diffusion process itself is performed in 2D. As a result, large 3D volumes do not pose a bottleneck for our approach.
    • Our test set includes 61 CT scans, some of which are large volumes (the biggest one has a shape of 512 x 512 x 1229). The primary difference for larger volumes is simply longer inference times.
  • As for real-time scenarios, it is important to note that, in the field of medical image analysis, accuracy is prioritized over speed.

    • For example, after a CT scan, both physicians and patients generally prefer to wait for a higher-quality reconstruction rather than receiving a lower-quality result immediately. Therefore, real-time efficiency is not the main focus in CT reconstruction applications.

    • We have now added this discussion to the revised manuscript, as follows:

      “In clinical CT reconstruction, accuracy is typically prioritized over real-time efficiency, as ...”


Q5. The current focus is on abdominal CT scans. The generalization to other anatomies or modalities is suggested as future work but not demonstrated.

  • Thank you for this valuable comment. We agree that this limitation should be more explicitly stated in the manuscript.

  • We have now made this point clearer in the Limitations and Future Work section. For example, we added:

    “The current study is limited to abdominal CT scans. Extending the proposed framework to other anatomies or imaging modalities remains an important direction for future research.”

评论

Thanks for the detailed clarifications to explicitly acknowledge the reuse of clDice and NSD metrics without modification. The additional discussion on segmentation error risks, computational cost, and the explicit statement of abdominal CT as a limitation help improve clarity.

However, my main concerns remain: (1) The work’s main novelty lies in applying existing metrics (clDice, NSD) to CT reconstruction evaluation, rather than introducing fundamentally new evaluation methods. While the authors clearly state this now, it limits the contribution’s originality.

(2) The evaluation remains constrained to abdominal CT, without testing on other anatomies or modalities. The authors acknowledge this as future work, but it remains a major gap for a general metric proposal.

(3) No comparison with radiologists’ qualitative assessment is provided. This remains a crucial missing piece for establishing clinical relevance. This comment was raised by another reviewer, which I think is also very important.

评论

1 | Design Evaluation Before "Innovate" Algorithms

Sparse-view CT reconstruction has many “innovative” algorithms published at top venues like CVPR and NeurIPS. But the field is heading in the wrong direction. Why? Because current evaluation metrics are flawed.

Metrics like PSNR and SSIM may look good on paper—but they miss what really matters in clinical practice: preserving organs, vessels, and diseases in the reconstructed scan.

There is a major disconnect. Computer scientists aim for pixel accuracy. Medical professionals need structural accuracy. That’s not the same.

Our team of 21 radiologists reviewed CT scans from state-of-the-art, "innovative" CT reconstruction methods. The result? Many reconstructions looked bad—missing key anatomy, hallucinating vessels, or showing heavy artifacts—even when the reported pixel-wise metrics were “excellent” or "significantly improved."

This is a serious problem.

Our paper is among the first to point it out and offer a solution. The goal isn’t just to propose better metrics like clDice and NSD. It’s to shift the way how this field evaluate methods.

We propose CARE, a plug-and-play framework that improves structural completeness. CARE helps move the field toward reconstructions that are not only visually sharp—but also clinically useful.


2 | Our Weakness

We acknowledge the weaknesses identified by the reviewer.

  1. Certainly, we would love to test CARE on more structures: more organs, vessels, bones, subsegments. Also on more imaging types: contrast CT, MRI, PET, ultrasound, and more. But frankly, one paper cannot cover all of this.

  2. We’re also open to new and better metrics beyond clDice and NSD. But that’s not the point here. The key battle is not “which metric is best.” It’s pixel-wise vs. structure-wise metrics. We adopt clDice and NSD because they are widely used for structural evaluation.

We disagree with the claim that our study lacks radiologist input and evaluation.

  1. 21 radiologists visually examined CT reconstructions from state-of-the-art, “innovative” methods—and were deeply disappointed. They saw missing anatomy, hallucinations, and severe artifacts. The observation is that the field is chasing novelty but ignoring realism.

  2. These same 21 radiologists also created voxel-wise annotations for organs, vessels, and intestines in our study. This ensured accurate segmentation evaluations and reliable loss gradients for diffusion model training.

CARE changes this. All 21 radiologists had reached the consensus that our reconstructions looked 100% better—with structures preserved far more reliably than any existing method.

评论

Thanks for the response, which addresses my problems. I have no further comments.

审稿意见
4

CARE addresses the gap between pixel-level image quality metrics and clinical sparse-view CT reconstruction. By introducing anatomy-aware metrics based on automated segmentation and a model-agnostic framework that penalizes missing structures during training, it improves the completeness of clinically vital anatomical structures across diverse categories. Traditional metrics are limited to evaluating the performance improvement, while the anatomy-aware assessment can be used for clinical evaluation.

优缺点分析

Strengths This work defines CT reconstruction quality through an anatomy-centric viewpoint and introduces CARE, a diffusion-based enhancement module that can be integrated into 3D construction algorithms. This paper introduces a segmentation-driven metric that exposes structural deficiencies overlooked by SSIM and PSNR, with performance gains of anatomical fidelity on baselines under sparse-view reconstruction of diverse structures.

Weaknesses The framework relied on the nnU-Net's segmentation accuracy, with risks of error propagation. Additional computations are involved in the latent diffusion and segmentation during training/inference. The hyperparameter setting is crucial to balance the anatomy-guidance loss and the pixel loss. The hyperparameter setting and generalization capacities need to be discussed. Additional description and discussions are required for the proposed anatomy-aware CT reconstruction metrics in training and testing process.

问题

  1. The framework and the metric relied on the nnU-Net's segmentation accuracy, with risks of error propagation. CARE’s anatomy-guidance loss tends to reinforce incorrect structures when given noisy segmentations.
  2. This paper proposed anatomy-aware CT reconstruction metrics. In the training process, the L1 norm was used to define the pixel-space reconstruction loss. The cross entropy was used to define the anatomy-guidance loss Ls. It is unclear whether the proposed metrics contribute to the model training. If the proposed metrics are only used for evaluation, they do not contribute to image reconstruction.
  3. The hyperparameter setting is crucial to balance the anatomy-guidance loss and the pixel loss. It is unclear whether the proposed model can avoid risks under-prioritizing structural completeness.
  4. Figure 6 shows a visualization of compared methods with and without the proposed semantic supervision. However, on several methods, such as TensoRF and R2-GS in the first case, the segmentation results are similar with and without care. It would be helpful to highlight the difference.

局限性

yes

最终评判理由

The authors have provided the detailed clarifications to main concerns in the review, including those related to the accuracy of the segmentation model, additional discussions on hyperparameter setting, and the role of anatomy-aware CT reconstruction metrics in the training and testing process. Additional discussions and descriptions of the metrics and experimental results help improve clarity. I still suggest that this work have discussions on the effects of segmentation errors, especially when existing CT reconstruction methods have shown brilliant performances. Moreover, the proposed anatomy-aware metrics are only used for evaluation, and the discussions are required to validate that the commonly used metric in the anatomy-guidance loss is feasible to contribute to the performance gains. I have no further questions at this time.

格式问题

na

作者回复

Thank you for your constructive comments and for highlighting important aspects of our framework.


Q1. Weaknesses include reliance on nnU-Net’s segmentation accuracy (risking error propagation), added computational cost from latent diffusion and segmentation, and sensitivity to hyperparameter settings. The paper should further discuss hyperparameter tuning, generalization, and the proposed anatomy-aware metrics in both training and testing.

  • Segmentation error propagation: Our segmentator achieves state-of-the-art performance on a large CT dataset, reducing such risks; CARE’s modular design allows easy integration of future segmentation improvements to further enhance reliability; the use of imperfect models during training is common practice in fields like GANs and reinforcement learning. (details in Q2)
  • Additional computation: The CARE framework involves extra computation due to the segmentation model and latent diffusion process during training, but this additional cost remains relatively modest and is justified by the significant gains in reconstruction quality and anatomical fidelity.
    • Segmentation model in CARE: The segmentation model only brings extra computational cost during CARE training. At inference time, our CARE model operates equivalently to a standard latent diffusion model, requiring no additional computations for segmentation.
    • 'Reconstruction+CARE' pipeline: Most computational overhead is incurred by the underlying reconstruction method, and the subsequent CARE model training typically constitutes only a small fraction of the total GPU hours and wall time.
    • For example, training a SAX-NeRF model on a CT scan takes about 12 hours, totaling 732 hours for our 61 CT scans on a GPU. In contrast, our CARE model uses 25 CTs reconstructed by SAX-NeRF, requiring roughly 24 hours (50k iterations) on a GPU. This results in only a 24/732=3.28% increase in computation.
  • Hyperparameter settings and generalization capacity: All implementation details, including training hyperparameters, are provided in Supp. Sec. A.3; The loss weights (λₙ = 1, λₚ = 1, λₛ = 0.001) were empirically determined, with λₛ carefully tuned to balance anatomical guidance and reconstruction quality. (details in Q4)
  • More discussion of metrics in training/inference: We have included detailed descriptions of the metrics’ integration into both the training and evaluation pipelines in Supp. Sec. D, along with the design rationale for each loss component.
    • Training: CARE uses an auxiliary segmentation task to guide reconstruction toward preserving anatomical structures, with segmentation-based supervision applied during training to evaluate organ fidelity.
    • Inference: During inference, CARE performs 2D slice-by-slice diffusion without the segmentation module; metrics are applied afterward to assess anatomical fidelity without affecting the reconstruction process.

We hope these clarifications address your concerns, and we appreciate your helpful feedback on how to improve the clarity of our manuscript.


Q2. The framework and the metric relied on the nnU-Net's segmentation accuracy, with risks of error propagation. CARE’s anatomy-guidance loss tends to reinforce incorrect structures when given noisy segmentations.

  • We acknowledge that inaccurate segmentations may impact CARE’s ability to learn robust anatomy-aware representations. However, our segmentator achieves state-of-the-art (SOTA) performance on a dataset with 1958 CT scans, minimizing such risks in practice. (details in Rev. Wkf4 - Q4)

  • The CARE framework has a modular design—future advances in segmentation models can be easily incorporated to further improve the reliability of anatomical guidance.

  • We would like to highlight that the use of imperfect models during training is not uncommon in the literature.

    • In GANs, the discriminator is continuously improved throughout training but is never truly perfect.
    • In reinforcement learning algorithms such as PPO, the reward model and value model are pretrained and may inherently carry biases from their training data.
    • As long as these models are sufficiently informative or well-calibrated, they can still provide meaningful guidance and contribute positively to the overall learning process.

Q3. This paper proposes anatomy-aware CT reconstruction metrics, using L1 loss for pixels and cross-entropy for anatomy guidance during training. However, it is unclear if these metrics aid training or are used only for evaluation, in which case they do not improve reconstruction.

  • Our proposed metrics utilize NSD and clDice, both of which originate from the segmentation domain and are non-differentiable by definition; thus, directly using them as anatomy-guidance losses during training is unpractical.

  • As segmentation metrics, NSD and clDice will naturally improve if we optimize towards accurate anatomical segmentation.

    • Aiming for general applicability within our CARE framework, we selected the widely used cross-entropy loss as the segmentation-guided loss function.
    • Dice loss is another common segmentation loss, but it often performs poorly when dealing with small or thin structures. For instance, if a thin vessel is completely missed, Dice loss immediately reaches its maximum value (1.0), providing little meaningful gradient information for training.
  • Although the anatomy-aware metrics themselves are used primarily for evaluation purposes, their segmentation essence has been integrated into our training pipeline via the anatomy-guidance loss Ls, which actively guides the reconstruction toward better anatomical fidelity.


Q4. The hyperparameter setting is crucial to balance the anatomy-guidance loss and the pixel loss. It is unclear whether the proposed model can avoid risks under-prioritizing structural completeness.

  • We have provided a comprehensive list of all implementation details in Supp. Sec. A.3, including training hyperparameters.
  • The noise-prediction loss Ln has a weight of 1; the pixel-space reconstruction Lp weights of λp=1\lambda_p=1, the anatomy-guidance loss Ls weights of λs=0.001\lambda_s=0.001. These parameters are chosen based on thorough empirical studies:
    • During training, we observed that both Ln and Lp losses decrease from approximately 0.1 to 0.01, while the cross-entropy segmentation loss Ls drops from around 3 to 0.1, necessitating a reduced weight for Ls in the overall loss.
    • Through ablation, we found that setting λs > 0.001 (e.g., 0.01) leads to the generation of meaningless noisy images, whereas too small a value diminishes the anatomical effect. Ultimately, λs = 0.01 provided the best balance between anatomical preservation and reconstruction quality.

Q5. Fig. 6 visualizes results with and without semantic supervision, but for methods like TensoRF and R2-GS, differences are minimal. Highlighting these differences would be helpful.

  • This high similarity occurs because the baseline CT reconstructions already achieve reasonable quality, making further improvements less visually distinguishable in the segmentation maps.

  • Even though their shape might look alike (Dice score 92.6 and 93.1), their NSD metric, which highlights the difference in boundary, is more prominent (NSD score 90.1 and 93.0).

  • To further address your concern, we computed a difference map by subtracting the two reconstructed volumes (with and without CARE), which more clearly highlights the regions of change.

    • Although we are unable to show this figure here in the response, our analysis revealed that the differences are most prominent in (1) the shape and contrast of bony structures, (2) organ grayscale values and boundary delineation against surrounding tissues, and (3) global texture patterns—for instance, the horizontal and vertical artifacts frequently observed in TensorRF without CARE are not present when CARE is used.

    • We have now supplemented the manuscript with these detailed observations and difference map visualizations to better highlight the improvements brought by CARE, as follows:

      “We include a difference map between reconstructions with and without CARE in the revised manuscript. These maps reveal that the most prominent changes occur at organ boundaries and within bone structures, where CARE more effectively reduces artifacts and enhances contrast.”

评论

Thanks for the detailed clarifications to main concerns in the review, including those related to the accuracy of the segmentation model, additional discussions on hyperparameter setting, and the role of anatomy-aware CT reconstruction metrics in the training and testing process. Additional discussions and descriptions of the metrics and experimental results help improve clarity. I still suggest that this work have discussions on the effects of segmentation errors, especially when existing CT reconstruction methods have shown brilliant performances. Moreover, the proposed anatomy-aware metrics are only used for evaluation, and the discussions are required to validate that the commonly used metric in the anatomy-guidance loss is feasible to contribute to the performance gains. I have no further questions at this time.

评论

1 | The effects of segmentation errors

  • We agree segmentation errors could, in theory, cause CARE to reinforce wrong structures. In the submitted paper (§3.2, Fig. 3), we already tested this indirectly by comparing segmentator-based metrics with ground-truth–based metrics. The correlation is very high (r = 0.99 for large organs, 1.00 for small organs, 0.95 for vessels), showing the segmentator is a strong proxy for manual labels.

  • We also ran an extra test: reducing segmentator accuracy from Dice ≈ 0.90 to ~0.70 by adding noise. CARE’s anatomy gains dropped only slightly—large-organ NSD from +36% to +31%, vessel clDice gain stayed above +20%. This suggests CARE is robust to moderate segmentation noise.

  • We will include these results and discuss error propagation explicitly in the camera-ready.


2 | The “brilliant" (?) performances

  • What is often called “brilliant” performance usually means high PSNR or SSIM, but this can be misleading. As we show in Table 1, high PSNR or SSIM scores do not guarantee structural completeness. For example:

    • NeRF achieves SSIM 0.868 and PSNR 25.7 dB, yet its small-organ NSD is only 0.2 and vessel clDice 34.5.

    • InTomo has SSIM 0.827 but recovers virtually none of the small-organ anatomy (NSD = 0.0).

    • Conversely, traditional methods like FDK and SART have lower pixel metrics but have competitive anatomy-aware rankings.

  • Thin or elongated structures, e.g., gallbladder, renal vein branches, mesenteric segments, often disappear in “high-PSNR” and "high-SSIM" reconstructions. Pixel-wise metrics have a ceiling effect (Fig. 5) and cannot expose these failures. This motivates our shift from “pixel consistency” to “structural completeness.”


3 | The anatomy-guidance loss has directly contributed to performance gains

  • There is some miscommunication here—we did not use the proposed anatomy-aware metrics only for evaluation.

  • As shown in §3.3 and Table 2, CARE used the segmentator during training, not just for evaluation. The reconstructed CT is compared to the reference segmentation, giving the model a clear structural signal.

  • Under 50-view settings, NeRF’s large-organ NSD jumps from 22% to 58% and vessel clDice from 29% to 55%, and InTomo’s vessel clDice rises from 5% to 48%. These large, consistent gains show the anatomy-guidance loss is a key reason CARE improves reconstruction quality.

审稿意见
5

The paper presents a new evaluation metric to address the reconstruction of abdominal CT. The new metric based on organ-anatomy preservation is more medically relevant than the standard pixel-level metrics PSNR and SSIM. Using the new metric, the authors introduced a new method to improve the sparse-CT reconstruction.

优缺点分析

Strengths:

  • The new metric is very relevant for the medical community since it measures the reconstruction fidelity of the organs.
  • The metric allows the development of anatomy-aware methods.

Weaknesses:

  • Based on my understanding, the metric focuses on abdominal area only. This information should be stated in the abstract and introduction, since there are other regions such as brain and whole-body CT.
  • Caption Fig4. Add more information since the losses appear in the figure but not in the caption. Is Lc a typo?
  • Line 247-251: The authors claim that the order given SSIM/PSNR changes drastically, when the methods are ordered in terms of the new metric, however, the re-ordering is not as drastic as stated, since the best methods in terms of ssim/psnr are still in the top methods in terms of the new metric. SAX-NeRF is the best method in terms of ssim/psnr and is the best/second best in terms of the new metric. I do agree with the usefulness of the new metric, but I believe that the statement should be reduced a bit.
  • Another concern is in terms of the quality of the Segmentator which can influence the overall performance negatively. A study testing different segmentation models should be performed.

The paper tackles an important problem of anatomy-aware metric for CT reconstruction. The proposed metric is sound and obtains good performance. However the metric is limited to abdominal organs which should be clearly stated in the paper, and experiments modifying the segmentator should be added.

问题

see weaknesses 1, 3, 4

局限性

yes

最终评判理由

I still believe that this work will benefit the community. The authors answered all my concerns, therefore I keep my initial score.

格式问题

NA

作者回复

Thank you for giving the Accept score and acknowledging the potential of our proposed metric for guiding anatomy-aware methods in CT reconstruction.


Q1. Based on my understanding, the metric focuses on abdominal area only. This information should be stated in the abstract and introduction, since there are other regions such as brain and whole-body CT.

  • We agree that our current manuscript does not clearly state that the proposed metric is specifically designed for abdominal CT reconstruction.

  • We have revised the abstract, introduction, and method sections to explicitly clarify that our work focuses on the abdominal region, distinguishing it from applications in other anatomical areas. For example, the revised abstract now includes:

    “Our proposed anatomy-aware metric is specifically developed for evaluating abdominal CT reconstruction, leveraging organ-level segmentation within the abdominal region.”


Q2. Caption Fig4. Add more information since the losses appear in the figure but not in the caption. Is Lc a typo?

  • “Lc” is a typo and should refer to “Ls” as defined in Equation (7).

  • We have now clarified the meaning of all the losses shown in Fig. 4, both in the caption and in the main text. The updated caption reads in part:

    “...CARE can be integrated into any reconstruction method to perform its enhancement capability. The overall training is supervised by three loss terms: the noise-prediction loss LnL_n, pixel-space reconstruction loss LpL_p, and anatomy-guidance loss LsL_s.”...


Q3. Lines 247–251 overstate the reordering effect of the new metric. While it is useful, top-performing methods under SSIM/PSNR, like SAX-NeRF, remain top-ranked under the new metric, so the claim should be toned down.

  • We agree that the statement in Lines 247–251 was overstated, and we have revised the text to remove any exaggerated claims.
    • While the overall ranking of methods may not change drastically, we believe the anatomy-aware metric provides a more sensitive differentiation between methods. For example, FDK and SART differ by only ±0.1 in SSIM/PSNR, but their differences reach -1.0 and +1.4 percent, respectively, on small organs and vessel scores under our proposed metric.
    • The revised text now reads, in part: “While the overall ranking is largely consistent under the new metric, the anatomy-aware evaluation provides finer granularity—such as the small-organ and vessel scores—enabling more meaningful comparisons between methods with similar SSIM/PSNR values.”

Q4. Another concern is in terms of the quality of the Segmentator which can influence the overall performance negatively. A study testing different segmentation models should be performed.

  • We have reported the segmentation performance of our nnUNet-based Segmentator on a large test set of 1,958 CT scans in Supp. Sec.B Tab. 3.
  • Due to the modular design, as the field of CT segmentation continues to advance, our framework can easily incorporate future improvements in segmentation methods.
  • To further address your concern, we compared our Segmentator with the top three state-of-the-art segmentation models highlighted in the Touchstone 1.0 Benchmark (Bassi, et al., NeurIPS 2024), evaluating all models on our 1,958 CT scans dataset. The complete comparison is shown below with IQR:
Anatomical StructuresSegmentator (Dice, %)MedNeXt (Roy, et al., MICCAI 2023) (Dice, %)MedFormer (Wang el al., NeurIPS 2024) (Dice, %)STU-Net-B (Huang et al., 2023) (Dice, %)
Large Organs
Liver96.9 (96.4,97.3)96.5 (95.8,96.9)96.7 (96.2,97.1)97.2 (96.8,97.5)
Kidney Left97.7 (97.4,98.0)97.2 (96.8,97.6)97.1 (96.7,97.5)96.9 (96.4,97.3)
Kidney Right97.7 (97.5,98.0)97.3 (97.0,97.7)97.0 (96.5,97.4)96.8 (96.3,97.1)
Spleen96.4 (95.5,97.1)96.5 (95.7,97.2)96.8 (96.0,97.3)96.0 (95.4,96.9)
Pancreas86.5 (81.5,89.3)88.1 (85.3,90.5)87.2 (84.0,89.6)85.9 (82.1,88.3)
Small Organs
Gall Bladder90.9 (86.1,93.4)90.5 (85.7,92.8)91.3 (87.4,93.5)91.7 (88.2,93.8)
Adrenal Gland Left84.6 (75.2,88.7)86.1 (77.3,89.4)85.4 (76.9,88.2)83.2 (75.0,87.0)
Adrenal Gland Right80.1 (71.7,83.5)81.4 (73.9,86.5)81.2 (72.8,85.4)79.5 (70.7,83.2)
Celiac Artery65.6 (56.5,73.4)66.8 (57.2,74.3)67.1 (59.5,75.2)65.9 (57.0,73.1)
Duodenum86.4 (82.4,89.5)85.3 (80.5,88.1)84.8 (80.2,87.4)85.1 (81.2,88.3)
Intestinal Structures
Colon84.4 (82.6,91.1)85.1 (83.0,91.3)85.2 (83.7,92.1)84.2 (82.1,90.5)
Intestine75.6 (69.1,81.4)76.8 (70.4,82.3)76.9 (71.3,83.0)75.0 (68.9,80.7)
Vascular Structures
Aorta92.1 (87.4,95.0)91.3 (86.8,94.2)90.9 (86.1,93.9)91.5 (87.0,94.5)
Postcava85.9 (79.3,87.4)86.0 (80.4,88.2)85.2 (78.9,86.9)84.7 (78.1,86.3)
Superior Mesenteric Artery66.5 (58.2,74.5)67.1 (59.1,75.2)67.3 (60.3,76.5)66.8 (58.4,74.8)
  • As shown, our Segmentator achieves highly competitive performance compared to these SOTA methods. We therefore believe that our choice of Segmentator for building the CARE framework is justified.

Reference

  • Bassi, Pedro RAS, et al. "Touchstone benchmark: Are we on the right way for evaluating ai algorithms for medical segmentation?" NeurIPS 2024.
  • Roy, Saikat, et al. "Mednext: transformer-driven scaling of convnets for medical image segmentation." MICCAI 2023.
  • Wang, Yihe, et al. "Medformer: A multi-granularity patching transformer for medical time-series classification." NeurIPS 2024.
  • Huang, Ziyan, et al. "Stu-net: Scalable and transferable medical image segmentation models empowered by large-scale supervised pre-training." arXiv 2023.
评论

Thank you for answering my concern. No further questions.

评论

Dear Reviewers, ACs, and PCs,

Thank you sincerely for your valuable time, detailed feedback, and thoughtful engagement with our work.

As highlighted in the reviews, our paper introduces CARE (Completeness-Aware Reconstruction Enhancement), a diffusion-based framework and evaluation metric suite designed to improve anatomical fidelity in sparse-view CT reconstruction. By leveraging a segmentation model trained on over 3,000 radiologist-annotated scans with state-of-the-art performance, CARE enables anatomy-aware learning and evaluation across multiple organ types. We show its clinical relevance through organ-wise analysis in four anatomical groups, and demonstrate consistent improvements across 9 backbone CT reconstruction models.

We have carefully addressed all reviewer comments in our rebuttal, including discussions on segmentation reliability, generalizability, the rationale behind our organ grouping scheme, and the modular design of CARE.

If you have any additional suggestions on how we might better clarify our claims, we would be grateful for your guidance.

With sincere appreciation,

The CARE Team

最终决定

This paper tackles a key limitation in sparse-view computed tomography (CT) reconstruction: pixel-wise metrics such as PSNR and SSIM often fail to capture anatomical completeness, particularly for small anatomical structures critical in clinical practice. The authors propose an evaluation framework (called CARE - Completeness-Aware Reconstruction Enhancement) that leverages anatomy-aware metrics derived from automated segmentation using a frozen nnU-Net trained on a large annotated CT dataset. CARE is model-agnostic and can be seamlessly integrated into both analytical reconstruction methods and modern learning-based methods, such as Neural Radiance Fields and Gaussian Splatting. Empirical results (including a small-scale reader study) show that CARE substantially improves structural completeness in reconstructed abdominal CT scans across different structures of the anatomy.

The paper sparked a lively discussion among the reviewers and the authors. The authors were able to provide a sufficiently convincing rebuttal to the initial reviews, prompting a couple of reviewers to upgrade their scores.

Across the board, the reviewers agreed with the following strengths of the manuscript:

  • The clinical impact of the work and the relevance of the proposed metric within the clinical community, especially in sparse-view CT reconstruction
  • The clarity of the presentation
  • The extent of the qualitative and quantitative experiments with multiple reconstruction baselines demonstrates consistent and statistically significant improvements in anatomical completeness

Some of the concerns raised by the reviewers were:

  • Dependence of the metric on a pre-trained segmentation model
  • Application of the metric only for abdominal CT
  • Increase in the computational cost due to the need to run a segmentation module and a latent diffusion model

While most of the reviewers' concerns were sufficiently and convincingly addressed, a few of them remain. Specifically, the utility of the proposed metric will truly be assessed by validating it on other anatomies and not just on abdominal CT scan reconstruction. Additionally, a more thorough comparison with other non-pixel-based metrics is also warranted. That said, given the widespread popularity of pixel-based metrics within the medical image reconstruction community, despite its known flaws, it is essential that any research that can address this glaring gap deserves to be presented to the broader community to accelerate research in this crucial area.