PaperHub
6.1
/10
Poster4 位审稿人
最低3最高4标准差0.4
3
3
3
4
ICML 2025

Introducing 3D Representation for Dense Volume-to-Volume Translation via Score Fusion

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24
TL;DR

We find it hard to train 3D volume translation models in inverse problems that require high accuracy, such as super-resolution. We effectively introduced 3D representation by ensmbling 2D models' results with a 3D model.

摘要

关键词
Diffusion models3D medical image generationvideo generation

评审与讨论

审稿意见
3

The authors present Score-Fusion, a volumetric translation model that learns 3D representations by assembling perpendicularly trained 2D diffusion models in score function space. It can reduce the 3D training computational cost and data demand, and the results is comparable in different downstream tasks. However, limitations exist.

给作者的问题

  1. About generalizability, I have mentioned some of my concerns under “Claims And Evidence”, please response to them
  2. I’m not clear of the network architecture, although you mentioned it is UNet-like. I don’t think you need to describe everything about the parameters you set in this network, but apparently the current description is not enough.
  3. “To further enhance the speed of multi-modality fusion, we employ a smaller variant of our model, adjusting the number of channels in each layer.” Would like to learn more about this.
  4. “For instance, DDMM-Synth (Li et al., 2023) suggested using both MRI and low-resolution CT scans to produce high resolution CT images. Training a separate model for each possible combination of input conditions would result in exponential time complexity, making it generally impractical …” I hold my opinion regarding the discussions here. T2 MRI -> T1 MRI image translation, and the T2 MRI -> CT image translation, are VERY different as MRI and CT have different physical properties. T2->T1 is within the same image modality, however, MRI->CT is using two different image modalities. From my understanding, the statement made by Li et al. 2023 was majorly becuase - at least you have to get some physics information of CT so that you can create an “accurate” high-resolution CT image that won’t impact the diagnosis. For example, it is possible that some tiny lesions could be identified by the CT but cannot be identified by the MRI. If not providing ANY CT information, the translation results is useless in terms of diagnosis.

论据与证据

I have concerns about the claim, since the authors made very broad claims about “Medical Image” in general.

  1. A concern is that the experimentations were done on brain only. Brain is well-known to be relatively the best structured and clean among all organs, compared to breast, liver, prostate, etc.
  2. The experimentations were done on single modality of medical image - MRI only, without any evidences from CT, ultrasound, Mammogram, etc.
  3. The authors ignoring the physical limitations of the medical imaging. It seems like the authors are treating all medical imaging as isotropic or nearly-isotropic. However, many of the MRI sequences do not generate such results with isotropic properties. For example, prostate MRI. T2Tse prostate MRI suffers from 320x320x20 dimensionality, which means higher resolution in x-y plane (320x320), and much lower resolution (20x320) in both x-z and y-z planes. In this case, I would doubt if the authors’ methodology still works or not. Therefore, although the authors have done experimentations on two different downstream tasks, the organ & image modality & image physics limitations let me feel concerned about the generalizability of the proposed method, and the validity of the claims made by the authors.

方法与评估标准

The concerns are stated under “Claims and Evidence” together.

理论论述

Yes. About Diffusion models and model fusions.

实验设计与分析

Yes.

补充材料

Yes, technical details like ablation studies, etc.

与现有文献的关系

More efficient way of using 2D Diffusion models to mimic 3D Diffusion models’ results, in Brain and in MRI specifically.

遗漏的重要参考文献

No

其他优缺点

N/A

其他意见或建议

N/A

作者回复

Dear reviewer XUPB,

Thank you for your detailed and constructive feedback. Here is our response to the raised questions and concerns:

Weakness (1-2): Generalizability beyond brain MRI:

Thank you for your insightful review. We highly agree that results on datasets beyond brain MRI can significantly strengthen our paper. We evaluated our method on sparse-view reconstruction with CT data:

Here are the quantitative results for sparse-view CT:

ModelPSNRSSIMMMD
TPDM33.780.92258.2
Ours-TPDM34.530.93120.78

We perform the sparse-view reconstruction task on the CHAOS[1] with 36 views. The CHAOS dataset contains CT/MRI images of the liver. Each CT volume has 78-289 slices of resolution 512x512. During training, we cropped the third axis to have a common shape of 64. During inference, we adopted a simple sliding window strategy along the third axis. We resized each CT slice to 256x256, following TPDM[2]. These results on the CT data demonstrate our method's effectiveness beyond MRI, showing the generalizability of Score-Fusion to other modalities (CT). It also demonstrates our method's effectiveness on a different organ (liver). The results help support the claim of “medical image” in the title.

We acknowledge that there are many other modalities, such as ultrasound, and there are other organs, such as the breast. The results on these datasets will also enhance our claims. Unfortunately, given the limited time during rebuttal, we are not able to include even more dataset results and we leave these applications to future work.

However, we believe it is a common practice to use two modalities, especially CT and MRI, for papers targeting volumetrics medical image translation, according to multiple previous works, including TPDM[2], ScoreSDE[3], DiffusionMBIR[4], etc.

[1] CHAOS - Combined (CT-MR) Healthy Abdominal Organ Segmentation Challenge Data, A.E. Kavur et al

[2] TPDM.

[3] Solving Inverse Problems in Medical Imaging with Score-Based Generative Models, Song et al.

[4] Solving 3D Inverse Problems using Pre-trained 2D Diffusion Models, Chung et al.

Weakness 3: Non-isotropic medical imaging:

As mentioned above, our new experiments on CT data are trained on 256x256x64 data due to the nature of the CHAOS dataset. Due to the time limit of the rebuttal, we did not get results on the fastMRI dataset yet, where the resolution is 320x320x20. We plan to include the fastMRI results in the second round of rebuttal. We hope this can still be considered.

As an extension of score-fusion, we tried to extend Score-Fusion for video super-resolution. We used MGLD-VSR[5] as the 3D model and used two perpendicular models on the time-space slices (i.e., (x-t) slices and (y-t) slices). Due to the capacity requirement of the video model, we are only using 5 consecutive slices in this problem, resulting in a 1280 x 960 x 5 or 2560 x 1920 x 5 resolution for the volume. This can potentially help address the concern of extreme resolution. Here are the quantitative results on the VideoLQ dataset[6]. We use DOVER[7] as a metric, which is a spatial-temporal metric for video quality assessment.

ModelDOVER(\uparrow)
MGLD0.748
Ours-MGLD0.755

Although the video results might be beyond the scope of the rebuttal, we would be glad if these preliminary results could help address the concern and could be included in the current paper.

[5] Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolution

[6] Investigating tradeoffs in real-world video super-resolution

[7] Exploring Video Quality Assessment on User-Generated Contents from Aesthetic and Technical Perspectives

Question 1:

See weakness 1-2

Question 2-3:

Network architecture: We have already included a detailed model architecture in Supplementary Material section F and Tables 10, 11, 12. We will emphasize this by pointing to the Supplementary Material section in the experiment section of the main paper for clarity.

Question 4:

We acknowledge that a lot of modality translation is a task setting that is not directly useful in the medical imaging domain, such as MRI to CT, where MRI lacks certain information due to physical properties, as the reviewer mentioned.

However, from a multi-modal learning perspective, efficiently training a multi-modal model by fusing knowledge from multiple single-modal models can be useful due to the difficulty in training a foundation model. We will modify the discussion on this part in our final version of the paper.

Also, we want to emphasize that our problem setting ( i.e., low resolution flair + T1ce => high resolution flair) does not have the issue that the reviewer mentioned.

I hope the above rebuttal addresses your concerns. Again, we would like to express our sincere gratitude for taking the time to review our paper and providing insightful questions.

审稿意见
3

In this work, the authors focus on using diffusion models for 3d volume-to-volume translation tasks such as super resolution and modality translation. Since 3d volumes (characteristic of medical data) are too big to computationally run a diffusion model, the authors first train two diffusion models in perpendicular axes. Then, they train a 3d diffusion model which is able to fuse both views to reconstruct the desired volume.

给作者的问题

none.

论据与证据

Yes

方法与评估标准

  • Yes, the authors conduct experiments on tasks like super resolution and modality translation which are common tasks in medical volume-to-volume translation.
  • They evaluate on metrics like PSNR, SSIM, FID which are standard metrics.
  • They evaluate both tasks on two datasets: Brats and HCP
  • Authors also conduct downstream task (medical segmentation) to justify the superior quality of their generated volumes

理论论述

N/A

实验设计与分析

  • Good experiments. See Method section above for more details.
  • Is there a reason why the authors did not considere LDM for the diffusion framework? Because DDPM-based models have image size constraints in 2D (authors had to use 192x192 sizes which is small).
  • The standard deviation and t-test of the results was not provided. Hence, unsure if the quantitative performance is statistically significant or not.

补充材料

Yes, reviewed all.

与现有文献的关系

This work relates to practical medical volume generation. Existing works in literature focus on 2D slices which has limited applicability in the real setting. Since the authors' work is in 3D, it has potential for real-world usage.

遗漏的重要参考文献

N/A

其他优缺点

My score is reduced mainly due to the comments in the Experiments section. If the authors can address this, then i can consider increasing the score after discussing with other reviewers as well.

其他意见或建议

N/A

作者回复

Dear Reviewer vsot,

Thank you for your insightful review of our work. Here is our response:

Experimental Question 1, LDM for diffusion framework:

The reason why we are not using LDM is related to our task setting.

Intuitively, the main aim of Super-resolution is to recover/generate the high-frequency detail from a lower resolution image. In both LDMs, the high-frequency details are mainly generated with an autoencoder, which may result in worse performance in super-resolution. Therefore, we chose DDPM over LDM, in which the high-frequency detail is carefully modelled with a diffusion process, although LDM could potentially improve computational efficiency.

Empirically, we find that the 3D LDM's autoencoder is usually unreliable. At an early stage of this project, we tried 3D LDM using medical diffusion[1]. Although the model can reconstruct the image with semantically correct and realistic results, the reconstruction error is already pretty large. resulting in a ~25 reconstruction PSNR. (We get a reconstruction PSNR of 26.23 on a subset of the validation set of the BraTs dataset). We do acknowledge there are more advanced 3D LDM recently, e.g. MAISI[3], but they rely on much more data (more than 20 datasets combined for MAISI) to train a reliable auto-encoder model, which makes an unfair comparison between models. In addition, if we use perpendicular 2D LDM, their diffusion process will be in different perpendicular 2D latent spaces, which makes the fusion of 2D scores difficult. Therefore, we do not use 2D LDMs such as Make-A-Volume[2].

Another reason is that Score Fusion is the first work to explore a learning-based method for score function fusion to the best of our knowledge. We believe that using a more standard and vanilla version of the diffusion model can provide more solid and generalizable insights in terms of score-function fusion.

In future work, we may consider extending score fusion to latent diffusion models in a text-to-image setting, where semantic information is better supported by the LDM. We will also include the negative results for 3D LDM and the discussion in the future version of our paper. Thank you for your insightful review.

[1] Medical Diffusion: Denoising Diffusion Probabilistic Models for 3D Medical Image Synthesis.

[2] Make-A-Volume: Leveraging Latent Diffusion Models for Cross-Modality 3D Brain MRI Synthesis.

[3] MAISI: Medical AI for Synthetic Imaging

Experimental Question 2, Statistical significance:

We have already included the std values in the Supplementary Material in Table 5, confirming that our improvements are statistically significant. In the future version of our paper, we will show the results more directly in the experiment section for clarity.

Again, we appreciate your recognition of our work's potential for real-world applications. We would like to express our sincere gratitude for taking the time to review our paper

审稿意见
3

The authors study medical volume-to-volume translation, presenting Score Fusion, a 3D volumetric translation model. The model is based on a fine-tuning process, which starts from an average of 2D models. The method is tested on multiple tasks on two medical datasets, being compared with a number of approaches.

给作者的问题

N/A

论据与证据

The claims are supported by empirical evidence.

方法与评估标准

The proposed solution is reasonable and efficient.

理论论述

N/A.

实验设计与分析

It is typical to perform super-resolution at multiple scales. This should be performed by the authors. Aside from this, I did not find any flaws in the experiments.

补充材料

Yes, read it all.

与现有文献的关系

The topic is interesting and method is timely.

遗漏的重要参考文献

There are some relevant references on medical image-to-image translation that are not acknowledged, e.g. [A, B]

[A] Haimour, Fatima, Rizik Al-Sayyed, Waleed Mahafza, and Omar S. Al-Kadi. "Bidirectional brain image translation using transfer learning from generic pre-trained models." Computer Vision and Image Understanding 248 (2024): 104100.

[B] Ristea, Nicolae-Cătălin, Andreea-Iuliana Miron, Olivian Savencu, Mariana-Iuliana Georgescu, Nicolae Verga, Fahad Shahbaz Khan, and Radu Tudor Ionescu. "CyTran: A cycle-consistent transformer with multi-level consistency for non-contrast to contrast CT translation." Neurocomputing 538 (2023): 126211.

其他优缺点

The paper is mostly easy to follow.

其他意见或建议

Section titles are not consistently capitalised.

作者回复

Dear reviewer AnYS,

We want to thank you for your feedback noting that "the topic is interesting and method is timely." and providing constructive suggestions for our paper.

Experimental Question 1, multi-scale SR:

Thanks for raising this point. We agree that multi-scale super-resolution is common practice. We tried to address your concern about performing super-resolution at multiple scales and to demonstrate the generalizability of different super-resolution scales. We performed experiments on 2x and 8x SR. Here are the quantitative metrics. (The 4x results are the original results presented in the paper.)

Quantitative Results for 2x Super-resolution:

ModelPSNRSSIMMMD
TPDM33.220.92923.48
Ours-TPDM34.610.94710.67

Quantitative Results for 8x Super-resolution:

ModelPSNRSSIMMMD
TPDM28.450.90369.17
Ours-TPDM29.380.92446.44

We can see that our model consistently outperforms baselines across multiple super-resolution scales for all three metrics, showing the generalizability to different super-resolution scales.

Question 2 Missing references:

We appreciate your suggestions [A, B] and will incorporate these in our final version:

Question 3, Typos:

Thank you for pointing this out. We will ensure consistency in section title capitalization in the final version and also check for other typos.

Again, we greatly appreciate your time and effort in reviewing our work and providing positive feedback and constructive advice.

审稿人评论

The authors have addressed the raised concerns.

作者评论

Dear Reviewers,

Thank you again for your time, constructive feedback, and thoughtful evaluations. We truly appreciate your acknowledgment that our rebuttal and updates have addressed the concerns raised. We're also grateful for your positive assessments and support of our work throughout the review process.

Warm regards,

Authors

审稿意见
4

The paper proposes to improve 3D representation learning of medical image volumes in diffusion models. Unlike earlier methods that ensemble 2D models by averaging their weights, the proposed method does “fusion in score function space”. This is shown to improve. They show improved performance in downstream tasks such as segmentation. Additionally, “ensembling in score function space” is also shown to result in feasible training of large 3D Diffusion Models by initializing using pretrained 2D models and then fine-tuning the 3D model. The synthetic generation quality is also shown to be better both via qualitative examples and quantitative image quality metrics.

给作者的问题

Again, to reiterate, Additional, downstream tasks such as sparse-view reconstruction and other volume-to-volume translation tasks could have been added to substantially add weight to the claim of better representation learning.

论据与证据

Yes, compared to multiple baselines, improved representation learning and subsequently improvement in downstream tumor segmentation task is shown.

方法与评估标准

Yes

理论论述

n/a

实验设计与分析

Yes, the experiment design and analysis looks sound.

补充材料

n/a

与现有文献的关系

The paper seems to have gone a step further in merging diffusion models to obtain a better 3D representation learning from pretrained 2D models. Whereas earlier methods simplify the problem of learning 3D data distribution by simplifying into product of 2D distributions, the proposed method ensembles the estimations from the 2D models to obtain weight vector and residual terms as learnable parameters for fine tuning.

遗漏的重要参考文献

n/a

其他优缺点

The paper is very well written and easy to follow with good synthesis of literature.

The contribution is well motivated and shows good synthesis performance and improvement on downstream tasks. The possibility to train a full capacity 3D diffusion models without having to resort to compromises such as diffusion in latent space rather than full 3D volume space is very enticing.

In the downstream BraTS tumor segmentation task, it is advisable to report challenge-specific metrics such as Lesion-wise Dice Score (rather than Dice Score over the entire image) that penalizes not only overlap but also heavily penalizes missing lesions. Additionally, 95% Haussdorff Distance may also be reported.

Additional, downstream tasks such as sparse-view reconstruction and other volume-to-volume translation tasks could have been added to substantially add weight to the claim of better representation learning.

其他意见或建议

n/a

作者回复

Dear reviewer ELBW,

We sincerely thank you for taking the time to review our paper and for your insightful and positive assessment, noting that "the paper is very well written and easy to follow with good synthesis of literature" and "the contribution is well motivated." Here is our response to your questions and reviews.

Weakness 1: challenge-specific metrics for tumor segmentation:

We have added lesion-wise Dice scores and 95% Hausdorff Distance (HD95) metrics below. Following [1], we set thresholds (150, 250, and 100 for TC, WT, and ET) to remove small predicted segmentations across all the evaluations.

Both Condition Results:

Modellesion-wise Dice scoresHD95
-TC ---- WT ---- ETTC ---- WT ---- ET
TPDM0.779 0.689 0.74218.107 18.937 32.011
Ours-TPDM0.783 0.692 0.74517.864 18.790 32.046
TOSM0.779 0.677 0.74518.287 20.223 32.287
Ours-TOSM0.781 0.698 0.74817.428 17.982 30.638

Modality Translation Results:

Modellesion-wise Dice scoresHD95
-TC ---- WT ---- ETTC ---- WT ---- ET
TPDM0.756 0.653 0.73625.085 23.854 39.001
Ours-TPDM0.760 0.685 0.75119.193 21.788 36.925
TOSM0.751 0.624 0.73421.254 21.227 37.359
Ours-TOSM0.758 0.671 0.74420.837 22.406 35.755

Super Resolution Results:

Modellesion-wise Dice scoresHD95
-TC ---- WT ---- ETTC ---- WT ---- ET
TPDM0.778 0.688 0.74618.079 18.894 31.793
Ours-TPDM0.780 0.692 0.74518.389 18.157 33.842
TOSM0.777 0.660 0.74718.180 21.043 32.052
Ours-TOSM0.778 0.690 0.74617.953 18.589 31.888

Our methods show improvements over the baselines in most cases, with a particularly notable gain for WT lesion segmentation under the Modality Translation task. Moreover, lowered Hausdorff distances suggest that our method helps the segmentation model capture subtle lesion boundaries more accurately, indicating more precise morphological alignment. This further demonstrates that our model's generated results are more friendly for downstream segmentation models.

[1] Ferreira, A.,et al. How we won brats 2023 adult glioma challenge? just faking it! enhanced synthetic data augmentation and model ensemble for brain tumour segmentation. arXiv preprint arXiv:2402.17317.

Weakness 2: sparse-view reconstruction in CT:

Thanks for raising this. We highly agree that sparse-view reconstruction in CT can substantially add weight to the claim of better representation learning.. We evaluated our method on sparse-view reconstruction with CT data:

Quantitative results for sparse-view CT:

ModelPSNRSSIMMMD
TPDM33.780.92258.2
Ours-TPDM34.530.93120.78

We perform a sparse-view reconstruction task on the CHAOS dataset[1] with 36 views. The CHAOS dataset contains CT/MRI images of the liver. Each CT volume has 78-289 slices, each slice 512x512. During training, we cropped the third axis to have a common shape of 64. During inference, we adopted a simple sliding window strategy along the third axis. We resized each CT slice to 256x256, following TPDM[2].

These results on CT data demonstrate our method's effectiveness beyond MRI, enhance the generalizability of Score-Fusion to other modalities, and help support the claim of “medical image” in the title.

Again, we sincerely appreciate your thorough review and constructive suggestions, which have helped us improve our work.

[1] CHAOS - Combined (CT-MR) Healthy Abdominal Organ Segmentation Challenge Data, A.E. Kavur et al

[2] TPDM.

审稿人评论

Thank you for addressing the suggestions. The earlier score is already high, so i am keeping the score as is.

作者评论

Dear Reviewers,

We sincerely appreciate the time and effort you’ve dedicated to reviewing our paper. Thank you for recognizing that our responses have addressed your earlier concerns. We're also grateful for your positive evaluations and for maintaining your supportive scores. Thank you again for your thoughtful feedback and contributions to improving our work.

Best regards,

Authors

最终决定

This paper initially received mixed reviews. After rebuttal and reviewers' discussion, all the reviewers provided positive ratings as the concerns have been adequately addressed by the rebuttal. As a result, this work received all positive ratings. Meta-reviewer agreed with the reviewers' recommendation and made final decision.