Weakly-Supervised Cortical Surfaces Reconstruction from Brain Ribbon Segmentations
We propose a weakly supervised method to reconstruct multiple cortical surfaces from brain MRI using ribbon segmentation maps.
摘要
评审与讨论
The submission presents a deep learning-based approach for cortical surface reconstruction (CSR) from brain MRI data using weak supervision derived from cortical brain segmentation maps. The claimed contributions are:
- Weak Supervision Paradigm: The authors introduce a new weakly supervised paradigm for reconstructing multiple cortical surfaces, significantly reducing the reliance on pseudo ground truth (pGT) surfaces generated by conventional CSR methods.
- New Loss Functions: Two novel loss functions are designed to optimize the surfaces towards the boundaries of the cortical ribbon segmentation maps. Regularization terms are also introduced to enforce surface uniformity and smoothness.
- Evaluation and Performance: The proposed method is extensively evaluated on two large-scale adult brain MRI datasets and one infant brain MRI dataset, demonstrating comparable or superior performance to existing supervised DL-based CSR methods.
优点
- The paper presents an approach to leverage weak supervision from segmentation maps instead of relying on pGT surfaces, which is a significant departure from traditional methods.
- The methodology is explained and the experimental setup is described. The authors conduct evaluations on multiple datasets, evaluating the efficacy and efficiency.
- The paper is well-structured, with clear descriptions of the problem, methodology, and results. The figures and tables effectively illustrate the performance and comparisons.
- The approach addresses a critical bottleneck in CSR by reducing the dependency on time-consuming and error-prone pGT surfaces, potentially broadening the applicability of CSR methods to more diverse datasets and clinical scenarios.
缺点
Method
-
It seems that this work combines [1] and [2], and thus has limited technical novelty. The architecture in Figure 1 and the circle consistency loss (Eq. 5) are almost identical to CoCSR [1]. The boundary surface loss and inter-mesh normal consistency loss (Eq. 3-4 and Figure 2) are very similar to the loss functions proposed by [2].
-
Additionally, the customized edge length loss (Eq. 6) has also been proposed by [3]. Considering the large individual differences across human brains, how did the authors choose the area A without knowing the pGT cortical surfaces?
-
It is confusing that the ribbon segmentations are used as both input and pGT. The authors claimed that the ribbon segmentations are inaccurate weak supervision, but still generated the initial surface based on ribbon segmentations according to Figure 1.
-
The velocity field defined in Eq. 1 is time dependent. How did the authors learn non-stationary velocity fields through a 3D U-Net?
-
In line 156, a bijective mapping with continuous inverse is called homeomorphism. A diffeomorphism is defined as a smooth/differentiable bijection with smooth/differentiable inverse.
-
As shown in Figure 2 (b), it is clear to observe that the WM and pial surfaces do not have the same normal directions in some regions. The inter-mesh normal consistency loss could cause inaccurate surface reconstruction. Could the authors provide more insights to solve this problem?
Results
-
The experimental results are unreliable and unconvincing. After careful comparison, it seems that the baseline results (CorticalFlow++, CortexODE, Vox2Cortex, DeepCSR) on the ADNI and OASIS datasets in Table 1 were directly copied and pasted from Table 2 in [1]. This leads to unfair comparisons.
-
Furthermore, as reported in Table 1, SegCSR produced no more than 0.061% of self-intersecting faces (SIF), whereas the authors claimed in line 264 that there are ∼0.3% on average for both white and pial surfaces. This is confusing. Which result is correct?
-
In line 263, the authors claimed that DeepCSR and U-Net produced a large number of SIFs without post-processing. However, the Marching Cubes algorithm only produces topological errors such as holes no SIFs.
-
The BCP dataset only includes 19 test subjects. Cross-validation should be conducted to ensure fair evaluation of the performance.
-
The flow ODE was integrated using the forward Euler method with T=5 steps. Such a large step size could cause unstable ODE solutions and failure in preventing self-intersections. The value of the Lipschitz constant should be reported to examine the numerical stability of the ODE solver.
-
The authors reported that SegCSR requires only 0.37s of runtime per brain hemisphere. However, SegCSR adopted a topology correction algorithm, which may take several seconds to a few minutes, to create an initial midthickness surface for each subject. This should be included in the total runtime. A breakdown of runtime should be reported and compared to SOTA baseline approaches.
[1] Zheng, H., Li, H. and Fan, Y. Coupled reconstruction of cortical surfaces by diffeomorphic mesh deformation. Advances in Neural Information Processing Systems, 2023.
[2] Ma, Q., Li, L., Robinson, E.C., Kainz, B. and Rueckert, D. Weakly Supervised Learning of Cortical Surface Reconstruction from Segmentations. arXiv preprint arXiv:2406.12650
[3] Chen, X., Zhao, J., Liu, S., Ahmad, S. and Yap, P.T. SurfFlow: A Flow-Based Approach for Rapid and Accurate Cortical Surface Reconstruction from Infant Brain MRI. MICCAI, 2023.
问题
- Can the authors elaborate on the key differences between their approach and [1,2,3], particularly in terms of methodology and experimental setup?
- How does the proposed boundary surface loss function improve upon the traditional bi-directional Chamfer loss used in existing methods?
- Can the authors provide more details on the computational efficiency and runtime comparisons with existing CSR pipelines?
局限性
The authors have addressed some limitations, but further clarity on the following aspects would be beneficial:
- The efficacy of SegCSR is influenced by the quality of pGT segmentations. More discussion on how to handle low-quality segmentations would be helpful.
- The constraint on inter-mesh consistency of deformation might affect the anatomical fidelity of pial surfaces. Further exploration of this trade-off is necessary.
- The method could be tested on more diverse cohorts to demonstrate its efficacy across various imaging qualities and subject demographics.
C1: Limited novelty: this work combines [1] and [2]. Explain methodology and exp setup of [1-3]
A1: Our SegCSR framework is model agnostic, and we choose CoCSR [1] as the baseline b/c it is the SOTA and able to reconstruct multiple cortical surfaces simultaneously. SegCSR is weakly supervised by pGT segmentations while [1] is a supervised method. Our other contributions include the loss functions and regularizations.
Seg2CS [2] is a concurrent study which was published on arXiv in mid-June while our paper was submitted to NeruIPS in mid-May. Their findings should not diminish the significance of our contributions. Technically, their boundary loss is like our Eqs. 2 & 3, their inflation loss forces the deformation along the normal direction of the initial WM surface, and it is unclear if they reconstruct surfaces sequentially or simultaneously. In contrast, our method uses the midthickness surface to bridge the inner and outer surfaces and compute the inter-mesh normal consistency loss between deformed WM and pial surfaces, which is stable in training and needs no cautious gradient computation as [2]. And our method reconstructs multiple cortical surfaces simultaneously.
SurfFlow [3] deforms a sphere template to WM and pial surfaces sequentially and validate on infant dataset. It is a supervised flow-based method similar to CF++. Its main contributions are starting from the sphere template, recurrent network design, and regularization terms.
In terms of exp setting, [1] and [3] (and other supervised methods) use the pGT surfaces from conventional pipelines for training and testing. For [2] and ours, we utilized pGT segmentations for training and pGT surfaces from conventional pipelines for testing. However, [2] did not report the fully supervised results of their model, thus the performance gap is unclear while CoCSR can be seen as the upper bound of our method.
C2: The edge length loss (Eq. 6) is proposed by [3]. How is the area A chosen?
A2: Eq. 6 consists of two terms and only the edge length loss is inspired by [3]. The other term promotes the surfaces’ smoothness. We will highlight this in the revision.
The area A is estimated on the surfaces generated from the pGT segmentations by computing the average facet area of the WM and pial surfaces respectively. Although there is no GT for A, this term only serves to force the triangular faces to be uniformally distributed. And we utilize a smaller weight on this term.
C3: Confusing: ribbon segs are used both input and pGT.
A3: The initialization is only an approximation. Its closeness to the true surfaces and the correctness of topology are two major concerns. Such initialization satisfies these two requirements.
The cortical ribbon segmentations contain structural/semantic info about the cortical sheet and can act as an attention guide for extracting informative features around its boundaries, thus supplementing the feature extraction along with the raw image and SDFs. Our preliminary experiments validate its effectiveness, and we will add the results to the revision.
C4: The VF in Eq. 1 is time-dependent. How to learn non-stationary VFs through a 3D U-Net?
A4: Eq. 1 describes the SVF framework if the vector field v is constant over time; If v is time-varying, then Eq. 1 becomes the LDDMM model. In this paper, our integration is computed by adding sampled velocities step-by-step, which can be seen as sampling from a series of SVFs in corresponding intervals.
C5: Line 156, homeomorphism diffeomorphism.
A5: Thanks. We will update the terminology in Line 156 to accurately reflect the use of diffeomorphism and homeomorphism.
C6: Fig. 2(b), the WM and pial surfaces do not have the same normal directions in some regions. The inter-mesh normal consistency loss could cause inaccuracy. More insights to solve this problem?
A6: We agree that the WM and pial surfaces are not 100% parallel, especially considering the nuance difference of corresponding vertices. This loss term aims to solve the problem that the surface cannot be deformed to deep sulci regions due to severe partial volume effect (PVE). We also propose the intensity gradient loss to position the surface and incorporate the mesh quality loss to further improve the mesh quality (smoothness and edge length).
C7: Exp results are unreliable and unconvincing. The baseline results on the ADNI and OASIS datasets in Table 1 are the same as those in Table 2 of [1]. Unfair comparisons.
A7: The authors of [1] kindly provided us with their dataset split, pre-processing and program code, and pre-trained models. We replicated their results, which are very close to what was reported in [1]. And we conducted all the experiments on the same datasets.
C8: Table 1, SegCSR produced < 0.061% SIF; Line 264 “∼0.3% on avg for both surfaces”. Confusing. Which one is correct?
A8: 1) Line 164 describes adult datasets only. 0.061 is for infant data. 2) The avg should be no larger than 0.01%. Will correct this.
C9: Line 263, “DeepCSR and U-Net produced a lot of SIFs without post-processing”. But MC algo produces topological errors but no SIF.
A9: While the MC algorithm may not directly produce SIFs, the output from DeepCSR and U-Net can introduce such artifacts that require additional post-processing to correct. Will clarify this in revision.
C10: BCP dataset only has 19 test subjects. Cross-validation is needed for fair evaluation.
A10: The authors of [3] kindly provided us with their dataset split, pre-processing code, and program code. They already employed stratified sampling by age to construct the dataset and maintain balance. We will use k-fold cross-validation to further evaluate our method.
We appreciate Reviewer LNir's effort in providing more than a dozen comments. Due to the 6,000-character limit, we cannot address all of them in the rebuttal. We hope the reviewers and ACs will consider our additional responses provided separately.
C11: The ODE solver has T=5 steps. Such a large step size could cause unstable solutions and SIF. Report the Lipschitz constant to examine the numerical stability of ODE solver.
A11: Thanks. CF++ uses the rule of thumb and checks that for all considered examples; cortexODE ensures , where and is Lipschitz constant. In practice, their results using T=10 are satisfying. Compared to their deformation length of the initial surface, our method starts from the midthickness surface, which shortens the need for choosing a large T. We also conducted a preliminary experiment using T=10, the performance is saturated compared to using T=5. Thus, we empirically choose 5 to strike a balance of efficiency and efficacy.
We will include these results and discuss the implications of the Lipschitz constant for the ODE solver's stability.
C12: Report SegCSR’s runtime 0.37s/hemisphere. Topology correction time should be included. A breakdown of runtime should be reported and compared to SOTA methods.
A12: Summary of the breakdown of runtime in inference.
| Time (s) | DeepCSR | 3D U-Net | CF++ | cortexODE | CoCSR | Ours |
|---|---|---|---|---|---|---|
| Pre | \ | \ | \ | 2.93 | 2.93 | 2.93 |
| Main framework | 125 | 0.81 | 1.84 | 0.49 | 0.22 | 0.24 |
| Post | \ | 0.14 | \ | \ | \ | \ |
For our SegCSR, the reported 0.37s includes MC and network forward propagation. The topology correction takes 2s and segmentation map generation takes 0.8s.
C13: More details on the computational efficiency and runtime comparisons with existing CSR pipelines?
A13: Please refer to my responses “A4 to Reviewer Guru” and “A6 to Reviewer DDcT”.
C14: How does the proposed boundary surface loss function improve upon existing bi-directional Chamfer loss?
A14: The major difference is on the pial surface reconstruction (Eq. 3 & Fig. 2-c). If using traditional bi-direction Chamfer loss, the model will overfit to the noisy pGT segmentation boundary (Fig. 2-c1, orange) and fail to deform to deep sulci regions. In contrast, using our uni-direction Chamfer loss, the model will drag the pial surface towards the deep sulci. With the help of other loss terms and regularization, the model will find a balance and address the PVE of pGT segmentations (Fig. 3-d vs c). We will rectify the typo in Eq. 3 and explain more.
C15: Limitations (1) SegCSR depends on pGT segs. More discussion on low-quality segmentations. (2) The inter-mesh consistency might affect the anatomical fidelity of pial surfaces. More exploration on the trade-off. (3) The method could be tested on more diverse cohorts to show its efficacy across various imaging qualities and subject demographics.
A15: We will add more discussion on limitations in Sect. 5.
(1) Please refer to our responses “A5 & A8 to Reviewer DDcT”.
(2) Please refer to our response “A6” above. And we will add results to Supplementary Materials.
(3) Please refer to our response “A4 to Reviewer zPL4”.
C16: Ethics review needed since involving human subjects
A16: We review and conform with the NeurIPS Code of Ethics. We want to highlight:
- We use well-established datasets from other resources with their consent. We did not perform experiments on human subjects directly.
- These datasets should have gone through IRB approvals in the corresponding institutions (Mayo Clinic and UNC). We are not responsible or capable of conducting extra reviews.
- In Sect. 5, we have clarified the Societal Impact and stressed that the deployment of the model in clinical settings should be approached with caution and under human supervision.
my primary concern remains that the contribution is not novel enough to stand out in the context of existing literature. Despite the distinctions the authors have highlighted, the similarities with prior works suggest that this approach may not offer sufficiently new insights. This could make it difficult for this paper to make a significant impact within the community.
Thank you for taking the time to review our rebuttal. Hopefully, the following response will help address your remaining concerns about the novelty and contributions of our work.
Based on a thorough review on cortical surface reconstruction (CSR) in “Sect. 2: Related Works”, including 1) Traditional CSR methods (FreeSurfer, BrainSuite, HCP, dHCP, and iBEAT V2.0) and 2) DL-based CSR methods (SegRecon, DeepCSR, PialNN, TopoFit, vox2cortex, CorticalFlow series, SurfFlow, CortexODE, and CoCSR), we would like to highlight:
-
A novel weak supervision framework. Unlike existing DL-based methods that rely heavily on pre-computed pGT surfaces from traditional pipelines, our approach introduces a weakly supervised framework that reduces this dependency. This is particularly important in scenarios where traditional methods struggle, such as infant data, offering an alternative for CSR that is more flexible and generalizable.
-
New loss functions and regularizations. We have introduced uni-directional chamfer loss in Eq. 3, inter-mesh normal consistency loss in Eq. 4, intensity gradient loss, and regularization techniques (mesh quality loss). These innovations directly address the challenges of utilizing pGT segmentation maps as weak supervision, which have not been explored in the existing literature.
-
We chose CoCSR as our backbone because it is the SOTA method and can reconstruct multiple cortical surfaces simultaneously. It serves as a fully supervised upper bound for our weakly supervised method, providing a clear benchmark for future studies in weakly supervised CSR tasks. To the best of our knowledge at submission, we are the first to study this problem and address this bottleneck in CSR.
-
Regarding mesh quality loss, while inspired by [3] for the customized edge length loss, we further enhanced it by combining it with normal consistency loss to improve the quality of the reconstructed surfaces. Both terms are useful for improving mesh quality. We will highlight the differences between ours and those in [1] and [3] in the revised paper.
-
Concerning the concurrent work [2] mentioned by Reviewer Lnir, it was posted on arXiv one month after our submission, making it impossible for us to compare with them at that time. Importantly, our method differs from [2] in four key aspects: a) inter-mesh normal consistency loss, which is stable in training and needs no cautious gradient computation as [2]; b) simultaneous multiple CSR; c) more thorough experimental comparison with both fully supervised and weakly supervised methods on various adult and infant datasets; and d) more validations on downstream tasks (e.g., reproducibility on the same subject, cortical thickness estimation).
- Broad applicability across datasets. Our method has been tested on both adult and infant datasets, demonstrating its adaptability across different imaging conditions. This broad applicability underscores the potential impact of our work in various real-world scenarios, where traditional methods may fall short.
We will stress the differences between our method and existing literature to highlight its unique contributions and potential for impact.
In summary, we believe this study has demonstrated the potential of weakly supervised CSR and hope it will ignite future research in this direction, as well as in other mesh reconstruction tasks (e.g., heart, bone, as mentioned by Reviewer DDcT).
The authors proposed a novel new method to jointly reconstruct multiple cortical surfaces using weak supervision from brain MRI ribbon segmentation results, which deforms midthickness surface deformed inward and outward to form the inner (white matter) and outer (pial) cortical surfaces. The proposed method is evaluated on two large-scale adult brain MRI datasets and one infant brain MRI dataset, demonstrating comparable or superior performance in CSR in terms of accuracy and surface regularity.
优点
- Propose a new weakly supervised paradigm for reconstructing multiple cortical surfaces, reducing the dependence on pGT cortical surfaces in training, unlike existing DL methods.
- Design two loss functions to optimize the surfaces towards the boundary of the cortical ribbon segmentation maps, along with regularization terms to enforce the regularity of surfaces.
- Conduct extensive experiments on two large-scale adult brain MRI datasets and one infant brain MRI dataset.
缺点
- It seems overclaim in the manuscript. The ‘pseudo’ ground-truth surface mentioned in the manuscript is actually the ground-truth mesh in other approaches, obtained by Marching cube/Free surfer. Since the chamfer distance is used to guide the network training, why do the authors claim the proposed method is weakly supervised?
- It is not clear how the original images are overlaid with the predicted mesh. Is any registration used? Details are missing.
- It seems the main contribution of the proposed SegCSR is the boundary loss function?
问题
- Why not use the total ADNI datasets for network training as what is used in previous research like DeepCSR and voxel2cortex?
- How the predicted meshes are overlaid on the original images? Details should be given.
- What does the ‘L-Pial Surface’ and ‘L-WM Surface’ in the tables mean? The Pial and WM surface of the left hemisphere. Why not also present the results for the right hemisphere?
局限性
The limitations are discussed in the msnuscript.
C1: Overclaim. The pseudo ground truth (pGT) surface mentioned in the manuscript seems the GT mesh in other approaches, obtained by Marching Cubes (MC)/FreeSurfer. Why is the proposed method weakly supervised?
A1: We have summarized the supervision signals used by our method and others in Table 4 of the Supplementary Materials. Mainstream supervised methods rely on conventional pipelines (e.g., FreeSurfer, iBEAT) to generate pGT cortical surfaces (pGT surfs) for both training and testing. In contrast, the main novelty of our work is that SegCSR is weakly supervised using pGT segmentations (pGT segs), without requiring pGT cortical surfaces generated by these conventional pipelines.
As discussed in the Introduction, conventional pipelines are time-consuming for extracting pGT surfaces, especially for large datasets. Moreover, they may fail to produce acceptable pGT surfaces, such as infant MRI. In contrast, pGT segment results are relatively easier to acquire, e.g., using fast DL-based methods.
SegCSR only utilizes pGT segment results to extract surfaces using the MC algorithm. But these surfaces provide noisy and weak supervision, particularly for the pial surface, which is significantly affected by partial volume effect (PVE) in the segmentations and fails to capture deep cortical sulci (Fig. 2-c). To address this, we propose the novel uni-directional loss function and other regularization terms to predict high-quality cortical surfaces (Fig. 3). We will highlight these contributions in the revision.
C2: How predicted mesh overlaid on the original images? Registration used?
A2: The reconstructed surfaces are in the same coordinate system as the original images, resulting in an exact match. No additional registration or post-processing is required.
This is because the ribbon segmentations are generated from the original images – naturally aligned with the images. The initial surface is extracted from these segmentations and remains aligned, while the deformation of the cortical surfaces occurs within the same space as the image and the initial surface. Thus, the resulting surfaces align perfectly with the original images.
For visualization, we simply load the reconstructed surfaces and the original image into FreeSurfer and capture screenshots for the 2D projection. We will include a more detailed description of the visualization process in the revision.
C3: It seems the main contribution of the proposed SegCSR is the boundary loss function?
A3: Our contributions are
- Weak Supervision Framework: We propose a novel weakly supervised learning approach that leverages pGT segmentations instead of relying on fully annotated surfaces. This approach reduces the dependence on labor-intensive data preparation and addresses issues with conventional methods, such as difficulty in generating accurate pGT surfaces for certain datasets like infant MRIs.
- Loss Functions: In addition to the uni-directional boundary loss function, we introduce other novel loss functions (e.g., intensity gradient loss) to handle the specific challenges of CSR, particularly the PVE and the inability to capture deep cortical sulci. These loss functions help ensure accurate and high-quality surface prediction.
- Regularization Terms: We incorporate additional regularization terms that contribute to the accurate prediction and smoothness of cortical surfaces, enhancing the overall quality of the reconstructed surfaces.
- Evaluation: We conduct extensive experiments on 2 large-scale adult brain MRI datasets and 1 infant brain MRI dataset. Our new method achieves comparable or superior performance compared to existing DL-based methods.
C4: Why not use the total ADNI datasets for training like DeepCSR and vox2cortex?
A4: ADNI is a large-scale longitudinal study and data are collected in different batches. We utilize the ADNI-1 baseline/screen official release (Line 227), consisting of 817 1.5T T1-weighted brain MRIs from subjects aged 55 to 90, including normal controls (NC), mild cognitive impairment (MCI), and Alzheimer’s disease (AD). This official release dataset is representative, covering a wide age range and balanced target population, including various subject conditions (NC, MCI, AD), ensuring stable model training, and facilitating fair comparisons across methods.
Previous methods have used varying amounts of data from ADNI: DeepCSR and CF++ used 3876 MRIs, vox2cortex used 1647 MRIs, cortexODE used 524 MRIs. The data selection criteria for these methods are not always clear, but it appears they may have combined multiple scans of the same subjects from ADNI or other resources. We chose to follow CoCSR to use the official release of ADNI-1 (817 MRIs) for fair comparison.
Additionally, we utilized the OASIS-1 dataset consisting of 413 T1w scans from subjects aged 18 to 96 years, including NC and AD subjects, the BCP dataset consisted of 121 subjects ranging in age from 2 weeks to 12 months, and the Test-retest dataset consisted of 120 T1w scans from three subjects aged 26 to 31.
In summary, our method has been evaluated on MRI scans of diverse subjects. We will expand our evaluation to include more ADNI batches in future work.
C5: ‘L-Pial Surface’ & ‘L-WM Surface’ means the Pial & WM surface of the left hemisphere? Why not also present the results for the right hemisphere?
A5: Yes, 'L-Pial Surface' refers to the pial surface of the left hemisphere, and 'L-WM Surface' refers to the WM surface of the left hemisphere.
The results for the right hemisphere are reported in the Supplementary Materials. We observed that the surfaces of both hemispheres are relatively symmetric, and the results are similar. Thus, we presented only the left hemisphere results in the main paper to conserve space. We can briefly mention the right hemisphere results in the paper or provide additional details if necessary.
Dear Reviewer zPL4,
Thank you for taking the time to review our manuscript. We hope that our detailed responses have adequately addressed your concerns and clarified the merits of our work.
If you find that we have resolved the issues raised, we kindly request that you reconsider our paper and your final rating. Your reassessment would be greatly appreciated and would help reflect the improvements made based on your valuable feedback.
Should you have any further questions or require additional clarifications, please do not hesitate to comment. With the few hours remaining for the author-reviewer discussion, we will do our best to provide any information needed.
Thank you once again for your consideration.
Authors of Submission-6996
The paper presents a deep learning approach to jointly reconstruct multiple cortical surfaces using weak supervision from brain ribbon segmentations derived from brain MRIs. The method leverages the midthickness surface and deforms it inward and outward to fit the inner and outer cortical surfaces by jointly learning diffeomorphic flows. Regularization terms are included to promote uniformity, smoothness, and topology preservation across the surfaces. Experiments are conducted on large-scale adult and infant brain MRI datasets.
优点
- The approach is novel in its use of weak supervision from readily available segmentation datasets, which reduces the burden of preparing pseudo-ground truth surfaces.
- The paper is well-written and structured, with a clear motivation for the method.
- The methodology is explained in detail, and the experiments are comprehensive.
- The approach has the potential to democratize the use of deep learning in cortical surface reconstruction by leveraging existing segmentation datasets.
缺点
- The paper's central contribution of weak supervision is undermined by the fact that the model is trained on pseudo ground truth surfaces for white matter and pial surfaces.
- The experimentation is limited to brain cortical surfaces and MRI images. Broader experiments involving different anatomies (e.g., bone cortical surfaces, heart walls) and imaging modalities would enhance the paper's impact.
- Results lack statistical significance analysis to validate sub-millimeter reconstruction errors.
- There is no evidence showing that improvements in mesh reconstructions correlate with enhanced performance in downstream analysis tasks.
- The robustness of the method regarding input noise/perturbation and images from multiple centers is not evaluated.
- There is no analysis of the computational complexity, including the resources and time savings provided by the proposed weak supervision.
- There is no sensitivity analysis on the choice of weights used to weigh the different components of the overall loss.
- The impact of ribbon segmentations quality (e.g., voxel spacing) as weak supervision is not investigated.
问题
- Can you provide evidence or analysis showing that improvements in mesh reconstructions lead to enhanced performance in downstream analysis tasks?
- How does the method perform with input noise or perturbations? What is the expected performance under domain shifts?
- What are the computational resources and time requirements saved by using weak supervision compared to traditional methods?
- How does the quality of ribbon segmentations (e.g., voxel spacing) impact the reconstruction accuracy?
局限性
Yes.
C1: The paper's central contribution of weak supervision is undermined by the fact that the model is trained on pseudo ground truth (pGT) surfaces for WM and pial surfaces.
A1: Previous DL methods typically rely on pGT surfaces from conventional pipelines as optimization targets, which we refer to as supervised methods. In contrast, our method utilizes only cortical ribbon segmentations for supervision, making it a weakly-supervised approach in terms of the accuracy and quality of the supervision signal. Although we generate pGT from these segmentations, the supervision remains coarser and less accurate compared to traditional supervised methods. This approach significantly reduces the computational cost for preparing pGT surfaces, making it more efficient.
C2: Experimentation is limited to CSR from MRIs. Broader experiments in more anatomies (e.g., bone, heart) and imaging modalities would enhance the paper's impact.
A2: We appreciate the reviewer's suggestion. Experimenting with different anatomies and imaging modalities would be valuable.
Our current focus on cortical surfaces of complex structures has driven us to develop advanced network architectures, loss functions, and regularizations. We believe our method could potentially generalize to simpler structures (e.g., bone, heart). But different modalities and anatomical challenges (e.g., non-watertight topology of the heart) may necessitate specialized model designs. We will discuss this in the revision and plan to explore these areas in future studies.
C3: Results lack statistical significance analysis to validate sub-millimeter reconstruction errors.
A3: We have conducted an independent t-test to assess the statistical significance of our results compared to other baseline models. E.g., on ADNI dataset, our method shows statistically significant improvements over DeepCSR, 3D U-Net, CF++, and vox2cortex, demonstrating the effectiveness of our method. We will report this in the revision.
C4: Provide evidence or analysis showing that improvements in CSR lead to enhanced performance in downstream analysis tasks?
A4: In Sect. 4.4, we conducted a reproducibility analysis, which is vital for cortical morphology studies as it assesses the consistency of measurements over time. Our SegCSR showed superior performance compared to DeepCSR and comparable results to the SOTA methods like cortexODE and CoCSR. This indicates that SegCSR can be reliably used for studying cortical thickness changes in patients.
Furthermore, we performed an experiment on cortical thickness estimation across a group of 200 subjects, comparing the results obtained from SegCSR with those from FreeSurfer. The high correlation (R = 0.94) between the two methods demonstrates our framework's capability to accurately capture cortical thickness, making it an alternative to both traditional and deep learning-based methods. We will include these results in the revision.
C5: The robustness of the method regarding input noise/perturbation and images from multiple centers? Expected performance under domain shifts?
A5: We appreciate the reviewer's insightful comment.
- Our method has been evaluated on two adult datasets and one infant dataset. Notably, the infant data has a smaller region of interest and a lower quality compared to adult MRIs, yet our method performs reasonably well, demonstrating adaptability to different image qualities.
- The ADNI-1 dataset used in our study includes images collected from different scanners over time, resulting in a range of image qualities. This diversity helps assess the robustness of our method across various imaging conditions.
- We conducted an experiment on the OASIS dataset by adding Gaussian noise () to the images. The results showed that the CSR performance did not degrade significantly (Pial-surf, ASSD: 0.321 vs. 0.329), indicating robustness to input noise.
- Due to limited rebuttal time, we could not complete a comprehensive evaluation using more diverse images from the ADNI, OASIS, and HCP datasets. We plan to conduct further experiments to assess the method's robustness across a broader range of imaging modalities and conditions.
C6: No analysis of the computational complexity (resources & time) of the proposed SegCSR.
A6: Please refer to my response "A4 to Reviewer Guru" for a detailed comparison of time efficiency. The GPU memory is as follows:
| GPU (GB) | DeepCSR | 3D U-Net | CF++ | cortexODE | vox2cortex | CoCSR | Ours |
|---|---|---|---|---|---|---|---|
| Training | 3.2 | 9.2 | 11.7 | 5.8 | 9.8 | 8.7 | 8.7 |
| Inference | 1.5 | 3.0 | 3.1 | 2.0 | 3.8 | 2.9 | 2.9 |
C7: There is no sensitivity analysis on the choice of weights used to weigh the different components of the overall loss.
A7: Please refer to my response "A5 to Reviewer Guru" for weights of loss functions.
C8: The impact of ribbon segmentations quality (e.g., voxel spacing) as weak supervision is not investigated.
A8: We have conducted an experiment on a subset of the OASIS dataset (100 samples for training and 30 for testing) by resampling the data to a 2mm resolution. The results of the ASSD on both surfaces are summarized below. We will include a comprehensive experiment on the entire dataset in the revision.
| L-Pial | DeepCSR | 3D U-Net | Ours |
|---|---|---|---|
| 1mm | 0.685 | 0.363 | 0.357 |
| 2mm | 1.795 | 1.030 | 0.496 |
| L-WM | DeepCSR | 3D U-Net | Ours |
|---|---|---|---|
| 1mm | 0.646 | 0.256 | 0.249 |
| 2mm | 1.414 | 0.597 | 0.471 |
Thanks for the clarifications and additional results. Atuhors have addressed my questions. I will raise my score accordingly.
Thank you for taking the time to thoughtfully review our paper and consider our clarifications and additional results. We appreciate your willingness to reassess our work and raise the score.
The paper presents a novel deep learning method for the reconstruction of cortical surfaces from 3D MRI. The proposed method follows an approach learning explicit surface deformations, in which a CNN is used to predict three velocity fields, corresponding to the pial, white matter and midthickness surfaces. Unlike previous techniques which use cortical surface pseudo ground truth (e.g., generated using FreeSurfer), the proposed method trains the network with faster-to-obtain segmentation pseudo ground truth. In addition to the standard surface prediction losses (based on Chamfer distance), the method uses 1) an Inter-Mesh Normal Consistency loss that encourages the pial and WM surface to be locally parallel, 2) an Intensity Gradient loss that place the surfaces at regions of high intensity gradients, 3) a Cycle Consistency loss enforcing inverse consistency between the midthickness-to-pial deformation and the midthickness-to-WM one, and 4) a Mesh Quality loss that helps having regular surface meshes (uniform sized triangles and smoothly varying normals). The method is evaluated on the ADNI, OASIS and BCP datasets, where its performance is compared to that of implicit and explicit approaches. Results show that the method obtains a better reconstruction accuracy compared to other techniques trained in a weakly supervised setting (pGT segmentation mask), but a lower performance than those trained with pGT cortical surfaces.
优点
-
The proposed method differs from previous approaches that explicit surface deformations by predicting a midthickness surface and incorporating additional loss terms that compensate for the weak supervision of pGT segmentation.
-
Experiments, involving three different datasets and comparing against several recent baselines, as well as including various ablation variants, are well designed. Results indicate superior performance in the weakly supervised setting.
缺点
-
The main motivation of the proposed method is doubtful. Authors motivate the need for their weakly-supervised cortical reconstruction method by the "prolonged processing time for generating pGT surfaces". However, as the pGT cortical surfaces can be generated automatically in an offline step, I believe the argument is weak. Moreover, recent pipelines for brain image processing, such as FastSurfer, can extract surfaces with comparable accuracy in a fraction of the time.
-
The accuracy of the proposed method is considerably lower than approaches which train on cortical surfaces. Furthermore, while it produces fewer topological artifacts like self-intersecting faces, those can be removed via post-proicessing in implicit methods like DeepCSR. Combined with my previous comment, the advantages of the method are unclear.
-
The ablation study in Table 2 indicates that most of the proposed loss terms have limited impact on the overall performance. For example, adding the Mesh quality loss seems to actually degrade performance in terms of CD, ASSD and HD.
问题
-
How does your method compare to other approaches in terms of training and inference time ?
-
The proposed method has several hyper-parameters (lambda1-5) that need to be tuned. How were the values selected for these hyper-parameters, and how sensitive is the method to the chosen values?
-
In Figure, why is the pial surface represented with two different colors (orange and purple) ?
-
In Eq (4), how do you compute the pial and WM surface normals if the point is on the midthickness surface?
-
p6: "where npG and npW are the normal vectors of the deformed vertex p on SM and SG respectively": Do you mean on S_G and S_W ?
-
p6: "segmentaions"
-
Section 4.2: Do you mean Table 1 ?
-
p8: "nromal"
-
p9: "Also, We can"
See weaknesses for main comments to answer.
局限性
Limitations are reasonably identified in the Conclusions section of the paper.
C1: Main motivation (prolonged time for generating pGT surfaces) is doubtful b/c the pGT surfaces can be generated automatically offline. Recent pipelines, e.g. FastSurfer, can extract surfaces in a fraction of the time
A1: The lengthy time to generate pGT surfaces is not our only motivation. Our key motivations are:
- Conventional pipelines involve multiple processing steps, leading to lengthy processing time.
- Each pipeline requires meticulously tuned parameters, posing challenges for generalization across diverse data domains, age groups, or acquisition protocols.
- DL-based methods have improved efficiency but rely on pGT surfaces from conventional pipelines, increasing computational cost of training data preparation.
We will highlight all these aspects in revision.
Although pGT surfaces can be computed offline, users often need to try different pipelines for different types of data. E.g., FreeSurfer (FS), designed for adult MRIs, does not perform well on infant BCP data. In contrast, segmentations offer a cohort-independent way for CSR and are a well-studied field with established tools. E.g., SynthSeg [1] can robustly segment diverse brain images.
FastSurfer’s recon-surf pipeline is largely based on FS and incorporates a spectral spherical embedding for CSR. Although faster, it still takes 1~1.5h runtime and has specific image quality requirements (e.g., no worse than 1mm, equivalent voxel size).
Our DL-based method aims to reduce reliance on pGT surfaces from conventional pipelines and is as fast as other DL alternatives.
We will emphasize these points more clearly in revision.
[1] Robust machine learning segmentation for large-scale analysis of heterogeneous clinical brain MRI datasets
C2-1: The accuracy is lower than that of methods trained on pGT cortical surfaces.
C2-2: While the SIF is lower, the artifacts can be removed via post-processing as in DeepCSR.
C2-3: The advantages of the method are unclear.
A2-1: Our weakly supervised CSR method is trained with segmentation supervision and evaluated by comparing to surfaces from conventional pipelines. In contrast, supervised methods are trained and tested on surfaces from conventional pipelines. It is expected that our method may not outperform all supervised methods in terms of accuracy. But our method outperforms DeepCSR & 3D U-Net (weakly sup.) and CF++ and vox2cortex (sup.), performs comparably to cortexODE (sup.), while is only inferior to CoCSR (sup.). These baselines are diverse, representative and strong, making our comparisons reliable and comprehensive.
A2-2: 1) The post-processing in DeepCSR is on the level set, but our method directly deforms the mesh. 2) Mesh-based post-processing is possible. Our method is flexible enough to incorporate such steps if desired. 3) In terms of SIF, our method is either superior or comparable to all DL baselines.
A2-3: The advantages are: 1) Our weakly supervised paradigm for CSR doesn't depend on pGT cortical surfaces for training, unlike existing DL methods. Segmentation is much easier to obtain. 2) New loss functions to optimize and regularize surfaces, facilitating easy training and inference. 3) Our method outperforms weakly supervised methods and some supervised methods, significantly narrowing the performance gap w.r.t. supervised methods.
C3: Tab 2, ablation study, loss terms have limited impact on the overall performance. E.g., adding the mesh quality loss degrades performance (CD, ASSD, HD).
A3: The loss terms are designed to complement each other, balancing accuracy and mesh quality (Lines 284-302). While leads to slightly degraded CD, ASSD, and HD, it helps reduce the SIF, improving the quality of the reconstructed surfaces. We will provide more visual results to show their impact.
C4: Compare your method with other methods on training and inference time.
A4: The runtime for 1 iteration across different methods is as follows:
| Time (s) | DeepCSR | 3D U-Net | CF++ | cortexODE | vox2cortex | CoCSR | Ours |
|---|---|---|---|---|---|---|---|
| Training | 0.98 | 1.97 | 4.63 | 1.8 | 2.2 | 2.1 | 2.5 |
| Inference | 125 | 0.96 | 1.84 | 0.49 | 0.37 | 0.22 | 0.24 |
C5: How were the values selected for the hyper-parameters of loss terms? How sensitive are they?
A5: Selection First, we identified reasonable ranges for the hyper-parameters based on prior work and preliminary experiments. For example, we found that and should be of the same order and generally larger than other regularization terms. Second, we used cross-validation on a subset of our dataset, incrementally adjusting the values to find an optimal set.
Sensitivity Our preliminary results indicate that the method is relatively robust within certain ranges. Suppose , performance is stable if is within [, ], within [, ], within [, 1], and within [, ]. Outside these ranges, particularly for and , we observed a more noticeable impact on overall performance. We will include more detailed analyses and discussions in the revised paper.
C6: Fig 2, why pial surface in two colors?
A6: The purple surface corresponds to the pGT pial surface from conventional methods, while the orange surface represents the pial surface generated from the GM segmentation map.
C7: Eq. 4, how to compute the pial and WM surface normals?
A7: There is a one-to-one correspondence among vertices of deformed surfaces. The normals, and , are computed for the corresponding deformed vertex on the pial surface and the WM surface , respectively (Line 191).
C8: 1) Line 191, do you mean S_G and S_W? 2) Sect. 4.2, do you mean Table 1? 3) Other typos.
A8: 1) Yes. 2) Yes. 3) Thanks. We will fix all of them.
I thank the authors for carefully answering my comments. After reading the rebuttal, I think the paper brings interesting contributions in terms of methodology however and am still not fully convinced about the method's usefulness in a real-life application. From my understanding, the main advantage of the method is that it avoids the need to wait for existing pipelines like FreeSurfer or FastSurfer to generate the pGT surfaces. However, since this is done in a pre-processing step (and volumes can be processed in parallel on a server), it seems like a small price to pay for a better accuracy during inference. Authors mention that these pipelines are sensitive to hyper-parameters, hence their method could be more robust in some cases, but do not really demonstrate this in their paper. Based on this, I give a final score of borderline accept.
Thank you for your thoughtful feedback and for recognizing the contributions of our methodology. We appreciate your concerns about the real-life applicability of our method compared to established pipelines.
While existing pipelines like FreeSurfer and FastSurfer can generate pGT surfaces in a pre-processing step, our method offers more than just time savings. It aims to reduce dependency on pre-computed pGT surfaces, which is beneficial in cases where varying data types or acquisition protocols might lead traditional pipelines to fail or produce suboptimal results.
In our preliminary experiments, we observed that FreeSurfer, which is optimized for adult MRIs, does not perform well on infant brain images in BCP data. Our method, leveraging robust segmentation results, which can be obtained by well-established tools (e.g., pre-trained segmentation models or SynthSeg), can handle such diverse datasets effectively and reconstruct cortical surfaces with high accuracy and desired topology. This application setting demonstrates our method’s flexibility and potential for handling data with varying qualities and types.
We acknowledge that additional experiments, particularly cross-dataset validations, are needed to fully substantiate these claims, which is a current limitation of our paper. However, the promising results on both adult and infant datasets provide us with confidence in the practical advantages of our method. We are committed to exploring these aspects further and will provide more comprehensive evidence in future work.
Thank you again for your valuable feedback and for considering our paper.
We thank all reviewers for their efforts in reviewing our paper and providing comments.
1. Motivation (Reviewers Guru)
- Conventional pipelines involve multiple processing steps, leading to lengthy processing time.
- Each pipeline requires meticulously tuned parameters, posing challenges for generalization across diverse data domains, age groups, or acquisition protocols.
- DL-based methods have improved efficiency but rely on pGT surfaces from conventional pipelines, increasing computational cost of training data preparation. We will highlight all these aspects in revision.
2. Contribution (Reviewers Guru, DDcT, zPL4, and LNir)
- Weak Supervision Framework: We propose a novel weakly supervised learning approach that leverages pGT segmentations instead of relying on fully annotated surfaces. This approach reduces the dependence on labor-intensive data preparation and addresses issues with conventional methods, such as difficulty in generating accurate pGT surfaces for certain datasets like infant MRIs.
- Loss Functions: In addition to the uni-directional boundary loss function, we introduce other novel loss functions (e.g., intensity gradient loss) to handle the specific challenges of CSR, particularly the PVE and the inability to capture deep cortical sulci. These loss functions help ensure accurate and high-quality surface prediction.
- Regularization Terms: We incorporate additional regularization terms that contribute to the accurate prediction and smoothness of cortical surfaces, enhancing the overall quality of the reconstructed surfaces.
- Evaluation: We conduct extensive experiments on 2 large-scale adult brain MRI datasets and 1 infant brain MRI dataset. Our new method achieves comparable or superior performance compared to existing DL-based methods.
3. Computation and time efficiency (Reviewers Guru, DDcT, and LNir)
Please refer to tables in my responses "A4 to Reviewer Guru" and "A6 to Reviewer DDcT".
4. Weights of loss functions (Reviewers Guru and DDcT)
Please refer to my response "A5 to Reviewer Guru".
5. Impact of Segmentation results (Reviewers DDcT, and LNir)
Please refer to my responses "A5 & A8 to Reviewer DDcT.
Below are my additional responses to Reviewer LNir in case my comments cannot be displayed.
C11: The ODE solver has T=5 steps. Such a large step size could cause unstable solutions and SIF. Report the Lipschitz constant to examine the numerical stability of ODE solver.
A11: Thanks. CF++ uses the rule of thumb and checks that for all considered examples; cortexODE ensures , where and is Lipschitz constant. In practice, their results using T=10 are satisfying. Compared to their deformation length of the initial surface, our method starts from the midthickness surface, which shortens the need for choosing a large T. We also conducted a preliminary experiment using T=10, the performance is saturated compared to using T=5. Thus, we empirically choose 5 to strike a balance of efficiency and efficacy.
We will include these results and discuss the implications of the Lipschitz constant for the ODE solver's stability.
C12: Report SegCSR’s runtime 0.37s/hemisphere. Topology correction time should be included. A breakdown of runtime should be reported and compared to SOTA methods.
A12: Summary of the breakdown of runtime in inference.
| Time (s) | DeepCSR | 3D U-Net | CF++ | cortexODE | CoCSR | Ours |
|---|---|---|---|---|---|---|
| Pre | \ | \ | \ | 2.93 | 2.93 | 2.93 |
| Main framework | 125 | 0.81 | 1.84 | 0.49 | 0.22 | 0.24 |
| Post | \ | 0.14 | \ | \ | \ | \ |
For our SegCSR, the reported 0.37s includes MC and network forward propagation. The topology correction takes 2s and segmentation map generation takes 0.8s.
C13: More details on the computational efficiency and runtime comparisons with existing CSR pipelines?
A13: Please refer to my responses “A4 to Reviewer Guru” and “A6 to Reviewer DDcT”.
C14: How does the proposed boundary surface loss function improve upon existing bi-directional Chamfer loss?
A14: The major difference is on the pial surface reconstruction (Eq. 3 & Fig. 2-c). If using traditional bi-direction Chamfer loss, the model will overfit to the noisy pGT segmentation boundary (Fig. 2-c1, orange) and fail to deform to deep sulci regions. In contrast, using our uni-direction Chamfer loss, the model will drag the pial surface towards the deep sulci. With the help of other loss terms and regularization, the model will find a balance and address the PVE of pGT segmentations (Fig. 3-d vs c). We will rectify the typo in Eq. 3 and explain more.
C15: Limitations (1) SegCSR depends on pGT segs. More discussion on low-quality segmentations. (2) The inter-mesh consistency might affect the anatomical fidelity of pial surfaces. More exploration on the trade-off. (3) The method could be tested on more diverse cohorts to show its efficacy across various imaging qualities and subject demographics.
A15: We will add more discussion on limitations in Sect. 5.
(1) Please refer to our responses “A5 & A8 to Reviewer DDcT”.
(2) Please refer to our response “A6” above. And we will add results to Supplementary Materials.
(3) Please refer to our response “A4 to Reviewer zPL4”.
C16: Ethics review needed since involving human subjects
A16: We review and conform with the NeurIPS Code of Ethics. We want to highlight:
- We use well-established datasets from other resources with their consent. We did not perform experiments on human subjects directly.
- These datasets should have gone through IRB approvals in the corresponding institutions (Mayo Clinic and UNC). We are not responsible or capable of conducting extra reviews.
- In Sect. 5, we have clarified the Societal Impact and stressed that the deployment of the model in clinical settings should be approached with caution and under human supervision.
This received mixed reviews. While some reviewers appreciated parts of the paper, the low empirical performance compared to existing supervised methods was a concern. Moreover, doubt was expressed regarding the practical utility of the method, given in part that the processing needed to apply state-of-the-art methods (with significantly higher performance) can be done offline to enable better but more costly solutions, whereas fast alternatives of similar (lower) quality already exist.
Adding to this, the most important new ideas in the paper are already present in
Ma, Q., Li, L., Robinson, E.C., Kainz, B. and Rueckert, D. Weakly Supervised Learning of Cortical Surface Reconstruction from Segmentations. MICCAI 2024
which significantly lowers the novelty of the paper.
As a result, I cannot recommend acceptance of this paper at NeurIPS 2024.