PaperHub
5.3
/10
withdrawn4 位审稿人
最低5最高6标准差0.4
5
5
6
5
3.8
置信度
ICLR 2024

Med-Tuning: Parameter-Efficient Transfer Learning with Fine-Grained Feature Enhancement for Medical Volumetric Segmentation

OpenReviewPDF
提交: 2023-09-15更新: 2024-03-26
TL;DR

In this paper, we present the study on parameter-efficient transfer learning for medical volumetric segmentation and propose a novel framework named Med-Tuning based on intra-stage feature enhancement and inter-stage feature interaction.

摘要

关键词
Parameter-Efficient Transfer LearningMedical Volumetric SegmentationBrain TumorKidney TumorIntra-stage enhancementInter-stage Interaction

评审与讨论

审稿意见
5

The paper proposes a new technique for adapting pre-trained weights from (among others) vision transformer architectures that have been initially trained on natural images to 3D medical image segmentation.

优点

The idea in principal is interesting. The method explores a different approach to conventional fine-tuning by inserting specially designed adaptation layers. The approach in particular deals with the challenge of moving from 2D pre-trained weights to 3D problems and also discusses the additional gap that stems from networks weights that were used for global classification to pixel-wise segmentation. The Fourier transform based module seems especially efficient and useful for larger context.

缺点

The experimental validation is not appropriate because it performs insufficient comparison to SOTA. In particular the “full” fine-tuning scheme (or model trained from scratch) are not at all 3D aware but have to solely rely on 2D operations. The proposed approach, however, gains access to inter-slice dependencies by inserting additional convolution / transformer modules that act on the depth / third dimensions. This leaves a wrong impression about the capabilities of the proposed model as e.g. for KiTS 19 the authors claim SOTA performance, but a simple 3D nnUNet (trained from scratch) reaches a far superior Dice score of 91.2% (composite), 97.4% (Kidney) and 85.1% (Tumour) The results for the Swin-UNet are even worse which highlights the weakness of 2D models for the tasks. While the ablation on data efficiency is interesting, it appears that most models are trained for a surprisingly short time 0.17-1.34hours only. This indicates that the full model could not have converged, making the comparison rather unfair.

问题

Explain the reasoning for omitting 3D nnUNet results and short training times. Add further ablations that demonstrate the motivation behind the Fourier module and the multi-scale concatenation of adaptation layers.

评论

SOTA Comparison. In our state-of-the-art (SOTA) comparison, we aim to demonstrate the capabilities of Med-Tuning by comparing it with 2D full fine-tuning and other PETL methods. On one hand, we show that Med-Tuning excels in capturing multi-scale information and inter-slice details, outperforming 2D segmentation models. This is evident in the substantial improvements in Expected Dice scores: +1.01% (Kidney), +8.02% (Tumor), and +4.52% (Composite) as presented in Table 2. On the other hand, according to the training time and tuned parameters, our Med-Tuning is generic and flexible, achieving a balance between training costs (i.e. time and tuned parameters) and accuracy, while training a 3D network like nnU-Net from scratch entails significant time and computational resources. Furthermore, upon comparing the current experimental findings, the Swin-Tiny-based model consistently outperforms the results based on ViT-Base, owing to the inherent advantages of Swin Transformer. This leads to the conclusion that the performance ceiling of our method can be elevated by employing a more robust backbone network, such as Swin-Base or ViT-Large. In other words, our Med-Tuning is a promising PETL framework that can continue to advance based on the progress of vision-based models, enabling researchers to conveniently transfer general-purpose large-scale models to better solve downstream tasks and ensure continued benefits medical imaging community.

​ In the paper, we opted not to present the results of the 3D network, not as an attempt to evade comparison, but rather because we deemed it inappropriate to directly compare the parameter-efficient transfer learning methods with 3D models. To further evaluate the generalization capability and superiority of our Med-Tuning, we conduct experiments to encompass additional body parts on another commonly used benchmark Medical Segmentation Decathlon (MSD) dataset, which has already been presented in Table 5 of our Appendix. we choose pre-trained Swin UNETR as a supplementary 3D medical baseline for comparison to convince the effectiveness of our method over 3D models. The experimental results of three tasks (i.e. heart, lung, spleen) are listed in Table 5 of Appendix. The same conclusion can be drawn from Table that our method consistently boosts the model performance with less memory cost and training time on various body parts.

评论

[Q2]. While the ablation on data efficiency is interesting, it appears that most models are trained for a surprisingly short time 0.17-1.34hours only. This indicates that the full model could not have converged, making the comparison rather unfair.

[A2]. We must clarify a deep misunderstanding, since we believe the judgment that "the full model could not have converged, making the comparison rather unfair" is quite unfair. It's important to emphasize that our intent was not to intentionally reduce the duration of training phase to create under-converged fine-tuned models for an unfair comparison that would artificially favor our method. Such a comparison would be both not objective and lack of academic integrity. In the Table 1 below, we demonstrate the segmentation accuracy of the full model across different training epochs. It is evident from the table that the performance of the model stabilizes after 250 epochs, and it even declines with much longer training time (possibly due to overfitting). The short training durations listed in the ablation study on data efficiency (ranging from 0.17 to 1.34 hours) are purely a result of the inherently parameter-efficient nature of transfer learning and the reduced training data, not a manipulation on our part. Therefore, in comparing different settings in our ablation experiments, we believe it is both objective and fair to standardize the training time to a fixed duration of 250 epochs.

Table 1: Ablation study on training epochs with ViT-B/16 backbone. The fine-tuning strategy is Full fine-tuning and the results are tested on official BraTS2020 testing dataset.
——————————————————————————————————————————————————————————————————————————————————————————————
| Epochs | Time(h)| Dice(ET)| Dice(WT)| Dice(TC)| Hausdorff(ET)| Hausdorff(WT)| Hausdorff(TC)|
——————————————————————————————————————————————————————————————————————————————————————————————
| 50     |  0.34  |  64.88  |  83.04  |  73.38  |    52.96     |     7.42     |    29.75     |
| 100    |  0.68  |  67.11  |  84.97  |  75.24  |    41.36     |     7.28     |    20.38     |
| 150    |  1.02  |  67.40  |  85.63  |  75.11  |    41.19     |     7.47     |    24.18     |
| 250    |  1.71  |  69.12  |  85.90  |  75.29  |    34.43     |     7.32     |    17.10     |  (default)
| 350    |  2.38  |  67.83  |  85.76  |  75.86  |    47.48     |     7.61     |    20.85     |
| 550    |  3.72  |  68.15  |  85.98  |  75.75  |    45.96     |     7.57     |    17.55     |
——————————————————————————————————————————————————————————————————————————————————————————————

[Q3]. Add further ablations that demonstrate the motivation behind the Fourier module and the multi-scale concatenation of adaptation layers.

[A3]. For your convenience in reviewing, we have already elaborated on the reasoning for omitting 3D nnUNet results and the short training durations in our responses above. Additionally, I'd like to re-emphasize the motivation behind the Fourier module and the multi-scale concatenation in our Med-Adapter: the FFT operation is employed for parameter-efficient global feature modeling, while the inter-stage concatenation can maintain the feature representations of different stages as much as possible in a parameter-free manner. The ablation studies concerning these two designs have been presented in Sec. 4.3 of the main text, which you may have noticed. We believe you might be expecting to see more ablation experiments for these two components, but we believe the current design and extent of our ablation studies are already quite comprehensive in our view. We look forward to a timely discussion with you during the rebuttal period, which will enable us to promptly provide you with more detailed ablation experiments, should they be required to satisfy your queries.

评论

Thank you sincerely for acknowledging the novelty and contributions of our paper and providing the positive feedback "The idea in principal is interesting. The method explores a different approach to conventional fine-tuning by inserting specially designed adaptation layers. The approach in particular deals with the challenge of moving from 2D pre-trained weights to 3D problems and also discusses the additional gap that stems from networks weights that were used for global classification to pixel-wise segmentation. The Fourier transform based module seems especially efficient and useful for larger context". In the Response section below, we have provided detailed explanations to address your questions. We genuinely hope that our explanations and the supplementary experiments with our best efforts will help us with the opportunity to raise your evaluation score of our work.

Our Responses to Paper Weaknesses:

[Q1]. The experimental validation is not appropriate because it performs insufficient comparison to SOTA. In particular the “full” fine-tuning scheme (or model trained from scratch) are not at all 3D aware but have to solely rely on 2D operations. The proposed approach, however, gains access to inter-slice dependencies by inserting additional convolution / transformer modules that act on the depth / third dimensions. This leaves a wrong impression about the capabilities of the proposed model as e.g. for KiTS 19 the authors claim SOTA performance, but a simple 3D nnUNet (trained from scratch) reaches a far superior Dice score of 91.2% (composite), 97.4% (Kidney) and 85.1% (Tumour) The results for the Swin-UNet are even worse which highlights the weakness of 2D models for the tasks.

[A1]. Distinctions with nnUNet. To be clear, this work greatly differs from previous SOTA methods (e.g., nnU-Net) with SOTA performance and excellent structure design in terms of motivation and methodology. nnU-Net focuses on tailored adjustments in beyond-network aspects (e.g., data pre-processing, post-processing strategies), aiming to design a powerful pipeline for biomedical research. While our motivation is to explore the emerging capabilities of various general-field strong models for medical volumetric segmentation. Our perspective is not solely to push the SOTA but to provide a promising PETL framework that enables researchers to conveniently transfer general large models to better solve the downstream tasks. Moreover, our method can also incorporate non-network related operations in nnU-Net (e.g., strong data preprocessing) to further enhance performance. Besides, although nnU-Net can achieve good performance using a simple UNet, it has several limitations. According to the authors of nnU-Net, due to a conflict with its core design principle of customizing the network topology for each dataset, nnU-Net cannot integrate large-scale pre-trained models and thus is unable to benefit from the advanced capabilities of pre-trained backbones, restricting the potential of nnU-Net. Furthermore, training a 3D network from scratch using nnU-Net entails significant time and computational resources. In contrast, our Med-Tuning is generic and flexible, achieving a balance between training costs (i.e. time and tuned parameters) and accuracy by leveraging effective pre-trained weights from the general image field. With continuous advancements in visual foundation models, our method holds even greater potential by harnessing successively improved foundation models to benefit the medical imaging community in a sustainable fashion.

审稿意见
5

This paper studies how to adapt 2D pre-trained models on the 3D segmentation problem, particularly focusing on the medical domain. Authors claimed they intend to tackle the challenge of modality gap and task gap when using 2D models pre-trained over natural images. Authors proposed Med-Tuning accordingly. The core techniques include decomposed convolutional layers, FFT/IFFT transformation and inter-layer feature interaction. The framework was evaluated on CT and MRI datasets including BraTS 2019, BraTS 2020 and KiTS 2019, and shown to outperform several baseline methods with fewer tuned parameters. Multiple architectures and pre-trained models were involved.

优点

  1. Adapting 2D pre-trained models on 3D medical tasks is an important and interesting task worth studying.
  2. The overall framework is technical sound and shows good performance.
  3. Experiments are exhaustive with multiple SOTA architectures and pre-trained datasets.

缺点

  1. The actual contribution of this paper is not consistent with the claim of 2D->3D adaptation. As described in the section 3.3, this paper used simple reshaping to adapt 2D pre-trained models on 3D data. This is also a weak part in the proposed framework as the solution did not consider the interaction between adjacent slices. There is already existing work investigating effective 2D->3D adaptation like [1].
  2. Lack of technical novelty. Using the FFT module to integrate global information is not a novel idea. It has been proposed in [2]. As analyzed in ablation studies, the FFT/IFFT module is an essential component of the framework. It’s not surprising that using a powerful module as the adapter results in decent performance.
  3. The review of related work should be improved.

[1] Adapting Pre-trained Vision Transformers from 2D to 3D through Weight Inflation Improves Medical Image Segmentation. Machine Learning for Health (ML4H) 2022 [2] Global Filter Networks for Image Classification. NeurIPS 2021.

问题

  1. What does concentration mean in Figure 3? It is not mentioned in the figure caption or main text.
  2. How do you choose the value of mn+1m_n+1 and why?
评论

Thank you sincerely for acknowledging the novelty and contributions of our paper and providing the positive feedback "Adapting 2D pre-trained models on 3D medical tasks is an important and interesting task worth studying. The overall framework is technical sound and shows good performance. Experiments are exhaustive with multiple SOTA architectures and pre-trained datasets". In the Response section below, we have provided detailed explanations to address your questions. We genuinely hope that our explanations and the supplementary experiments with our best efforts will help us with the opportunity to raise your evaluation score of our work.

Our Responses to Paper Weaknesses:

[Q1]. The actual contribution of this paper is not consistent with the claim of 2D->3D adaptation. As described in the section 3.3, this paper used simple reshaping to adapt 2D pre-trained models on 3D data. This is also a weak part in the proposed framework as the solution did not consider the interaction between adjacent slices. There is already existing work investigating effective 2D->3D adaptation like [1].

[1] Adapting Pre-trained Vision Transformers from 2D to 3D through Weight Inflation Improves Medical Image Segmentation. Machine Learning for Health (ML4H) 2022.

[A1]. We want to clarify that the comment "the proposed framework as the solution did not take into account the interaction between adjacent slices" misinterprets the content of our work and is not objective.

Although this reference[1] employs a weight inflation strategy to transition pre-trained Transformers from a 2D to a 3D context, preserving the advantages of both transfer learning and depth of information. However, our method proposed in this work can also utilize the volumetric correlations between slices and spatial multi-scale features to achieve accurate 3D image segmentation. In particular, beyond employing basic shaping operations to refine 3D data during the input stage, we have incorporated the 3D volumetric and spatial multi-scale designs within our Med-Adapter to facilitate the learning of interactions among neighboring slices and spatial pixels in the 3D data. We employ the Adapter-based parameter-efficient transfer learning strategy, which is very distinct in its implementation details from the approach outlined in paper[1].

Moreover, upon scrutinizing the outcomes presented in Table 1, Table 2, and Table 9, the performance improvement is evident in the segmentation achieved by our proposed method (2D-model+reshape+Med-Adapter to capture adjacent slice interactions and spatial multi-scale features) compared to the full fine-tuning (2D-model+reshape). Notably, our method not only demonstrates enhanced segmentation performance but also boasts a greatly reduced number of training parameters. The obtained results fully demonstrate the ability of our method to effectively learn crucial information between adjacent slices of 3D data and multi-scale features.

Considering the comprehensiveness of our related work section, we intend to incorporate this paper[1] into our reference and improve the related work section.

[Q2]. Lack of technical novelty. Using the FFT module to integrate global information is not a novel idea. It has been proposed in [2]. As analyzed in ablation studies, the FFT/IFFT module is an essential component of the framework. It’s not surprising that using a powerful module as the adapter results in decent performance.

[2] Global Filter Networks for Image Classification. NeurIPS 2021.

[A2]. Our work does indeed share some relevance with GFNet[2], as both studies leverage the intrinsic global properties of FFT. However, the use of traditionally powerful modules like FFT does not imply a lack of innovation in our research. The focal points of GFNet[2] and our work in incorporating FFT/IFFT are much distinct. GFNet[2] primarily emphasizes modifications to the overall model structure with FFT/IFFT, whereas our study focus on the first exploration of FFT/IFFT into parameter-efficient fine-tuning for medical volumetric segmentation task. The fundamental difference in our approach compared to GFNet[2] lies in that we essentially utilize the parameter-efficient properties of FFT for global feature modeling, as opposed to GFNet[2], which solely exploits the intrinsic global properties of FFT. In our proposed Med-Adapter, FFT is introduced to replace large-kernel convolutions or attention mechanisms, creating a parameter-efficient global branch. In this context, FFT/IFFT serves as an important, yet partial, component of Med-Adapter. It's worth noting that we do not highlight this aspect as the main innovative element of our paper. Therefore, we believe that evaluating our work as lacking technical novelty solely based on our use of FFT is not appropriate.

评论

[Q3]. The review of related work should be improved.

[A3]. Thanks for this helpful suggestion. Per your advice, we have incorporated more comprehensive related works (including the given two references) to make this necessary improvement.

[Q4]. What does concentration mean in Figure 3? It is not mentioned in the figure caption or main text.

[A4]. I believe you're referring to concatenation, which involves combining the given feature representations along the channel dimension. Please refer to Equation 9 in the main text for more details about the representation of concatenation operation.

[Q5]. How do you choose the value of m_n+1 and why?

[A5]. We apologize for any confusion. The value of m_n is automatically determined based on the selection of the backbone model, rather than being an artificially adjusted hyperparameter. To illustrate, if we use the officially released Swin-Tiny as the encoder, then m_n+1=2 in this context. And if we choose to use the officially released ViT-B/16 as the encoder, then in this case we divide the 12layers into 4 stages with 3layers per stage, at which point m_n+1=3. We have removed this illustration in the manuscript, as Figure 3 is sufficient to show that the Inter-FI operation is introduced only at the last of each stage of encoder.

评论

Thank you for the authors' response, which partially addressed my concerns. I have also reviewed other reviewers' comments. Currently, I still believe that the significance of this paper is somewhat lower than the high standard set by ICLR. The core technical contribution of this paper includes some key components, such as channel-wise separable convolution and FFT, which are existing techniques. While these components have only been discussed in general deep learning papers (e.g., backbone design) before, I do not believe that using them in this parameter-efficient transfer learning pipeline presents a significant technical challenge or deep insight. Therefore, I will maintain my score.

审稿意见
6

In this work, the authors propose a new framework called Med-Adapter for PETL (Parameter Efficient Transfer Learning). They aim to use the easily available pretrained 2D Transformer models for the 3D segmentation task. They do this via fine-tuning of adapter blocks that they introduce into the Transformer architecture. The authors aim to bridge two gaps in fine-tuning: 1) the task gap that existing pretrained models are trained for classification but we would like to fine-tune them for segmentation; and 2) the modality gap that existing pretrained models are trained on 2D images while 3D medical volumes also have a temporal (depth) component. By conducting experiments on 3 datasets and 3 backbone architectures, the authors demonstrate that their fine-tuning method can achieve good segmentation performance in a parameter-efficient manner.

优点

  1. The authors solve the task gap (classification/segmentation) by introducing FFT and IFFT which are lightweight and perform similarly to attention mechanism.
  2. Since they deal with 3D data, they perform appropriate reshaping in order to use the existing 2D transformer blocks. In the Med-Adapter blocks that they introduce, they perform 3D convolution but use the parameter-efficient versions such as depth convolutions and 1 x K x K etc as approximations.
  3. The authors do extensive experiments to validate their method. They experiment on 3 datasets (BraTS 2019, 2020 and KiTS 2019) on 3 backbones (ViT + UPerNet and Swin-T) and compare against several PETL baselines. Their method consistently achieves better performance against the baselines with much lesser trainable parameters.
  4. I appreciate the authors including experiments on several pretrained weights such as CLIP and the recent SAM.
  5. The authors perform appropriate ablation studies -- on the different branches in IntraFE; different ways to fuse features in InterFI; ratio used in reshaping; decoder architectures; and reduced data setting.

缺点

  1. Could the authors discuss on the applicability of their method on non-Transformer architectures such as large pretrained VGGs, ResNet50, InceptionV3 and so on? Currently the proposed method is only validated for Transformer architectures.
  2. While the proposed method achieves better performance compared to baselines, the novelty of the method seems limited. This is because depth-wise convolutions as well as 1 x K x K convolutions to approximate 3D convolutions already exist in literature as ways to reduce parameter count. Furthermore using FFT and IFFT for attention and reduced parameters also exists in literature [1,2,3]. Hence the proposed method seems like an amalgamation of several existing concepts.

References

[1] Chi, Lu, Borui Jiang, and Yadong Mu. "Fast fourier convolution." Advances in Neural Information Processing Systems 33 (2020): 4479-4488.

[2] Yang, Yanchao, and Stefano Soatto. "Fda: Fourier domain adaptation for semantic segmentation." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.

[3] L. Bai, X. Lin, Z. Ye, D. Xue, C. Yao and M. Hui, "MsanlfNet: Semantic Segmentation Network With Multiscale Attention and Nonlocal Filters for High-Resolution Remote Sensing Images," in IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1-5, 2022, Art no. 6512405, doi: 10.1109/LGRS.2022.3185641.

问题

  1. Please also see the weakness above.
  2. The final row in Table 3 corresponds to which method --- Table 1 or Table 9? Specifically, I cannot see in Table 1 or Table 9 where WT has a Dice value 90.05. I believe the numbers are supposed to be consistent.
  3. As standard deviation values have not been provided, please mention if the results in bold are just numerically better, or, if t-test [4] has been conducted to check if the performance improvement is statistically significant or not.

References

[4] Student, 1908. The probable error of a mean. Biometrika, pp.1–25.

评论

Thanks very much for acknowledging the novelty and contributions of our paper, as well as the provided positive feedback "The authors solve the task gap (classification/segmentation) by introducing FFT and IFFT which are lightweight and perform similarly to attention mechanism. The authors do extensive experiments to validate their method. I appreciate the authors including experiments on several pretrained weights such as CLIP and the recent SAM. The authors perform appropriate ablation studies". We have addressed all your questions in detail in the Response section below, and we sincerely hope that our explanations and the supplemented experiments with our best efforts will help us earn an raised evaluation score of our work.

We really appreciate your attention and positive comments to our experiments on CLIP and SAM, which demonstrates the extensibility of our approach: Med-Adapter can continuously achieve better performance with the development of a generalized foundation model.

Our Responses to Paper Weaknesses:

[Q1]. Could the authors discuss on the applicability of their method on non-Transformer architectures such as large pretrained VGGs, ResNet50, InceptionV3 and so on? Currently the proposed method is only validated for Transformer architectures.

[A1]. Thank you for posing this question. We have already conducted experiments using CNN architectures, specifically ResNet-34, as shown in the below table. However, the performance of the Med-Adapter was suboptimal in these trials. We believe that the disparity in implementation details between our Med-Adapter for CNN and Transformer architectures may be a contributing factor to these results. When adapting our Med-Adapter into CNN architectures on the basis of maintaining our original structure design as much as possible, the original feature maps must be reshaped to token sequence at the beginning and then fed into the following down projection layer, which will inevitably lead to the loss of spatial position information and consequently deteriorated model performance. Therefore, we does not choose to add these results into our improved manuscript.

Tabel 1: Results on BraTS 2019 using Res-UNet model with ResNet-34 backbone (pre-trained on ImageNet-1k).
—————————————————————————————————————————————————————————————————————————————————————————————
| Method      | Dice(ET) | Dice(WT) | Dice(TC) | Hausdorff(ET)| Hausdorff(WT)| Hausdorff(TC)|
—————————————————————————————————————————————————————————————————————————————————————————————
| Full        |  76.92   |  77.71   |  89.28   |    4.755     |    6.291     |    8.010     |
—————————————————————————————————————————————————————————————————————————————————————————————
| VPT-Below   |  77.24   |  76.45   |  88.81   |    4.823     |    7.288     |    7.992     |
| Adapter     |    -     |    -     |    -     |      -       |      -       |      -       |
| AdaptFormer |    -     |    -     |    -     |      -       |      -       |      -       |
| Pro-tuning  |  75.08   |  75.45   |  88.60   |    6.790     |    8.167     |    8.083     |
| ST-Adapter  |    -     |    -     |    -     |      -       |      -       |      -       |
—————————————————————————————————————————————————————————————————————————————————————————————
| Ours        |  75.95   |  75.85   |  88.49   |    5.881     |    8.813     |    7.652     |
—————————————————————————————————————————————————————————————————————————————————————————————
评论

[Q2]. While the proposed method achieves better performance compared to baselines, the novelty of the method seems limited. This is because depth-wise convolutions as well as 1 x K x K convolutions to approximate 3D convolutions already exist in literature as ways to reduce parameter count. Furthermore using FFT and IFFT for attention and reduced parameters also exists in literature [1,2,3]. Hence the proposed method seems like an amalgamation of several existing concepts.

[1] Chi, Lu, Borui Jiang, and Yadong Mu. "Fast fourier convolution." Advances in Neural Information Processing Systems 33 (2020): 4479-4488.

[2] Yang, Yanchao, and Stefano Soatto. "Fda: Fourier domain adaptation for semantic segmentation." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.

[3] L. Bai, X. Lin, Z. Ye, D. Xue, C. Yao and M. Hui, "MsanlfNet: Semantic Segmentation Network With Multiscale Attention and Nonlocal Filters for High-Resolution Remote Sensing Images," in IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1-5, 2022, Art no. 6512405, doi: 10.1109/LGRS.2022.3185641.

[A2]. The depth-wise convolutions and FFT/IFFT employed in our approach are indeed inspired by existing methods. But references[1-3] focuse on the overall model structure modification, while our work focuses on presenting the first study on parameter-efficient fine-tuning for medical volumetric segmentation.

​ On the one hand, although channel-wise separable convolutions have rarely shown innovation in the evolution of model backbone design, the potential of depthwise separable convolutions in the realm of parameter-efficient transfer learning has yet been under explored. Channel-wise separable convolutions are highly efficient in achieving a lightweight model structure, thus in this paper, we are among the first few works to investigate its potential in realizing 3D medical image segmentation through parameter-efficient transfer learning. Building on channel-wise separable convolutions, we propose a new Med-Adapter for PETL, serving as a plug-and-play component to address both multi-scale representations and inter-slice correlations. With our well-designed Med-Adapter, we introduce a new framework namely Med-Tuning, which achieves a balance between segmentation accuracy and parameter efficiency. Therefore, overall, we believe the novelty of our work should not be overshadowed by the incorporation of channel-wise separable convolutions.

​ On the other hand, our work does indeed share some relevance with previous works (e.g., GFNet[2] mentioned by Reviewer SCo3), as both studies leverage the intrinsic global properties of FFT. However, the use of traditionally powerful modules like FFT does not imply a lack of innovation in our research. The focal points of this kind of previous works and our work in incorporating FFT/IFFT are much distinct. GFNet[2] primarily emphasizes modifications to the overall model structure with FFT/IFFT, whereas our study focus on the first exploration of FFT/IFFT into parameter-efficient fine-tuning for medical volumetric segmentation task. The fundamental difference in our approach compared to GFNet[2] lies in that we essentially utilize the parameter-efficient properties of FFT for global feature modeling, as opposed to GFNet[2], which solely exploits the intrinsic global properties of FFT. In our proposed Med-Adapter, FFT is introduced to replace large-kernel convolutions or attention mechanisms, creating a parameter-efficient global branch. In this context, FFT/IFFT serves as an important, yet partial, component of Med-Adapter. It's worth noting that we do not highlight this aspect as the main innovative element of our paper. Thus, we believe that evaluating our work as lacking technical novelty solely based on our use of FFT is not appropriate.

[Q3]. The final row in Table 3 corresponds to which method --- Table 1 or Table 9? Specifically, I cannot see in Table 1 or Table 9 where WT has a Dice value 90.05. I believe the numbers are supposed to be consistent.

[A3]. The results presented in Table 3 were obtained through five-fold cross-validation assessments using the BraTS 2019 training set, as emphasized at the beginning of Sec. 4.2 and 4.3. Conversely, the models in Table 1 were trained on the official BraTS 2019 training set and subsequently submitted to the official website to acquire results for the BraTS 2019 validation set. These two experiments are intentionally distinct, consistency between them was not a requirement. The official evaluation process was employed to ensure the validity of the results of SOTA comparation in Tabel 1 and Table 9. We sincerely regret any confusion and acknowledge the need for clarity.

评论

[Q4]. As standard deviation values have not been provided, please mention if the results in bold are just numerically better, or, if t-test [4] has been conducted to check if the performance improvement is statistically significant or not.

[4] Student, 1908. The probable error of a mean. Biometrika, pp.1–25.

[A4]. To be clarify, in the quantitative comparison with state-of-the-art models, as illustrated in Tables 1, 2, and 9 in our manuscript, we calculate the mean of results obtained from five replicated experiments. For the ablation studies, we leverage the average results derived from a 5-fold cross-validation. Here, per your valuable suggestion, we provide the standard deviation results for selected experiments in Table 2 below (which is the Table 1 in the main text), for supporting reference. It is evident from the data that bold results not only exhibit numerical superiority but also demonstrate statistical advantages.

Table 2: Performance comparison on BraTS2019 with ViT-B/16 backbone. Numbers in () indicate standard deviations and numbers outside () indicate Dice or Hausdorff scores.
——————————————————————————————————————————————————————————————————————————————————————————————————————
| Method      |   Dice(ET)  |   Dice(WT)  |   Dice(TC)  | Hausdorff(ET)| Hausdorff(WT)| Hausdorff(TC)|
——————————————————————————————————————————————————————————————————————————————————————————————————————
| Scratch     | 64.96(0.24) | 83.03(0.14) | 71.34(0.32) | 7.635(0.16)  | 10.602(0.50) | 10.942(0.19) | 	 	 
| Full        | 68.49(0.17) | 85.56(0.04) | 75.12(0.46) | 6.672(0.49)  | 7.878(0.16)  | 10.525(0.09) |
| Head        | 65.71(0.03) | 84.19(0.06) | 74.77(0.24) | 6.128(0.30)  | 7.505(0.04)  | 7.864(0.32)  |
——————————————————————————————————————————————————————————————————————————————————————————————————————
| VPT-Shallow | 66.02(0.09) | 84.72(0.11) | 75.84(0.23) | 6.114(0.09)  | 7.506(0.02)  | 8.471(0.01)  |
| VPT-Deep    | 67.01(0.18) | 85.14(0.03) | 76.80(0.31) | 6.064(0.17)  | 7.717(0.13)  | 7.648(0.09)  |
| Adapter     | 68.30(0.10) | 85.37(0.12) | 77.05(0.34) | 5.501(0.03)  | 7.636(0.05)  | 7.986(0.31)  |
| AdaptFormer | 65.88(0.13) | 84.34(0.12) | 74.77(0.50) | 6.652(0.57)  | 8.204(0.03)  | 8.430(0.23)  |
| Pro-tuning  | 67.18(0.13) | 85.32(0.03) | 76.51(0.20) | 5.805(0.01)  | 7.073(0.27)  | 7.564(0.26)  |
| ST-Adapter  | 69.18(0.15) | 86.27(0.03) | 79.18(0.06) | 6.077(0.04)  | 6.939(0.02)  | 6.778(0.28)  |
——————————————————————————————————————————————————————————————————————————————————————————————————————
| Ours        | 70.53(0.10) | 86.58(0.03) | 79.35(0.05) | 5.862(0.08)  | 6.224(0.03)  | 6.947(0.11)  |
——————————————————————————————————————————————————————————————————————————————————————————————————————
审稿意见
5

With the recent success of foundation models, there is an increasing interest in fine-tuning these models for various problems. So, how to do the fine-tuning is a valid question, and updating as few parameters as possible during fine-tuning is vital for limited labeled data regimes and computation costs. With this motivation, the paper proposes a plug-and-play block for parameter-efficient fine-tuning of neural networks. The proposed block can be added to any architecture and only the block's parameters (along with the decoder added for segmentation) are updated during fine-tuning. The block consists of inner blocks: intra-feature enhancement (IntraFE) and inter-feature interaction (InterFE). In IntraFE, local features are extracted using channel-wise separable convolutions while FFT is used for global features. The results of each branch in IntraFE are summed and passed through a 1x1x1 convolution. The output of IntraFE is given to the InterFE block where it is concatenated with the features from the previous stage.

The experiments are conducted in 3 different datasets for segmentation, and the results show slight improvement compared to the state-of-the-art in terms of Dice and Haussdorff metrics, with a better accuracy-efficiency tradeoff.

优点

  • The paper presents extensive experiments on three different medical image segmentation datasets. Additionally, it presents a quite extensive ablation study to show the contribution of each component of the proposed block.

  • The accuracy/efficiency trade-off is improved significantly compared to the state-of-the-art methods.

缺点

  • The experiments are performed with k-fold cross validation; however, only the average Dice score/Hausdorff distance results are presented in the tables. Especially, given that the quantitative results are very close to some of the existing methods, standard deviation results (or statistical significance test) would be helpful to interpret the results better.

  • The contribution of the branches in Intra-FE block is not very significant, according to Table 3. There is a very small difference between the first row and the third row. Standard deviation results are also required for this experiment. The most significant contribution appears to be achieved by the CM according to the table. I would expect to see Conv3-Conv5 with CM to understand the contribution of the FFT block better. If the results are not different that Conv3-Conv5-FFT-CM, then we can conclude that the FFT branch doesn't have many contributions. It is crucial to see this since FFT is proposed as one of the main contribution of the block.

  • The contribution of the Inter-FE block is not significant, according to Table 4. Standard deviation results should also be added for this experiment.

  • I also have concerns about the novelty of the proposed block. Channel-wise separable convolutions have already been widely used for parameter-efficient training, e.g. [1]. Also, given that the contributions of the FFT branch and InterFI block are very small, it seems that the main improvement of the accuracy comes from the channel-wise separable convolutions, which is already an existing technique.

[1] Chollet, Xception: Deep Learning with Depthwise Separable Convolutions - https://arxiv.org/abs/1610.02357

  • In the current experiments, the proposed block is added after each block. How would the results change if it is used e.g. after only a few earlier blocks?

问题

Please address my comment in the weaknesses section.

评论

We sincerely appreciate your recognition of the novelty and contributions of our paper, as well as the provided positive comments "The paper presents extensive experiments on three different medical image segmentation datasets. Additionally, it presents a quite extensive ablation study to show the contribution of each component of the proposed block" and "The accuracy/efficiency trade-off is improved significantly compared to the state-of-the-art methods.". We have addressed all of your questions in detail in the following Response section and will incorporate all feedback in the final version. We genuinely hope that our detailed explanations and the additionally supplemented experiments with our best efforts will give us the precious opportunity to raise the evaluation score of our work in your perspective.

Our Responses to Paper Weaknesses:

[Q1]. The experiments are performed with k-fold cross validation; however, only the average Dice score/Hausdorff distance results are presented in the tables. Especially, given that the quantitative results are very close to some of the existing methods, standard deviation results (or statistical significance test) would be helpful to interpret the results better.

[A1]. Thanks for this helpful suggestion. To be clarify, in the quantitative comparison with state-of-the-art models, as illustrated in Tables 1, 2, and 9 in our manuscript, we calculate the mean of results obtained from five replicated experiments. For the ablation studies, we leverage the average results derived from a 5-fold cross-validation. Per your advice, we have supplemented the standard deviation results to the corresponding Table 1, which is listed below for your convenience to check.

Table 1: Performance comparison on BraTS2019 with ViT-B/16 backbone. Numbers in () indicate standard deviations and numbers outside () indicate Dice or Hausdorff scores.
——————————————————————————————————————————————————————————————————————————————————————————————————————
| Method      |   Dice(ET)  |   Dice(WT)  |   Dice(TC)  | Hausdorff(ET)| Hausdorff(WT)| Hausdorff(TC)|
——————————————————————————————————————————————————————————————————————————————————————————————————————
| Scratch     | 64.96(0.24) | 83.03(0.14) | 71.34(0.32) | 7.635(0.16)  | 10.602(0.50) | 10.942(0.19) | 	 	 
| Full        | 68.49(0.17) | 85.56(0.04) | 75.12(0.46) | 6.672(0.49)  | 7.878(0.16)  | 10.525(0.09) |
| Head        | 65.71(0.03) | 84.19(0.06) | 74.77(0.24) | 6.128(0.30)  | 7.505(0.04)  | 7.864(0.32)  |
——————————————————————————————————————————————————————————————————————————————————————————————————————
| VPT-Shallow | 66.02(0.09) | 84.72(0.11) | 75.84(0.23) | 6.114(0.09)  | 7.506(0.02)  | 8.471(0.01)  |
| VPT-Deep    | 67.01(0.18) | 85.14(0.03) | 76.80(0.31) | 6.064(0.17)  | 7.717(0.13)  | 7.648(0.09)  |
| Adapter     | 68.30(0.10) | 85.37(0.12) | 77.05(0.34) | 5.501(0.03)  | 7.636(0.05)  | 7.986(0.31)  |
| AdaptFormer | 65.88(0.13) | 84.34(0.12) | 74.77(0.50) | 6.652(0.57)  | 8.204(0.03)  | 8.430(0.23)  |
| Pro-tuning  | 67.18(0.13) | 85.32(0.03) | 76.51(0.20) | 5.805(0.01)  | 7.073(0.27)  | 7.564(0.26)  |
| ST-Adapter  | 69.18(0.15) | 86.27(0.03) | 79.18(0.06) | 6.077(0.04)  | 6.939(0.02)  | 6.778(0.28)  |
——————————————————————————————————————————————————————————————————————————————————————————————————————
| Ours        | 70.53(0.10) | 86.58(0.03) | 79.35(0.05) | 5.862(0.08)  | 6.224(0.03)  | 6.947(0.11)  |
——————————————————————————————————————————————————————————————————————————————————————————————————————
评论

[Q2]. The contribution of the branches in Intra-FE block is not very significant, according to Table 3. There is a very small difference between the first row and the third row. Standard deviation results are also required for this experiment. The most significant contribution appears to be achieved by the CM according to the table. I would expect to see Conv3-Conv5 with CM to understand the contribution of the FFT block better. If the results are not different that Conv3-Conv5-FFT-CM, then we can conclude that the FFT branch doesn't have many contributions. It is crucial to see this since FFT is proposed as one of the main contribution of the block.

[A2]. Per your valuable advice, we have supplemented the corresponding experiments and results in the below Table 2, from which (the comparison of second, third and forth rows) can clearly observe the contributions of the introduced branches (including the parameter-efficient FFT branch) in our Intra-FE block.

Table 2: Ablation study on our Intra-FE with Swin-T backbone. ConvK denotes two cascaded 3D depth-wise convolutions with a kernel size of 1×K×K and K × 1 × 1 separately, CM indicates the channel mixing operation by a 1 × 1 × 1 convolution. Numbers in () indicate standard deviations and numbers outside () indicate Dice scores.
—————————————————————————————————————————————————————————————————————————
| Conv3 | Conv5 | FFT | CM |   Dice(ET)   |   Dice(WT)   |   Dice(TC)   |
—————————————————————————————————————————————————————————————————————————
|   √   |   -   |  -  |  - |  75.42(0.25) |  89.77(0.14) |  80.22(0.30) |
|   √   |   √   |  -  |  - |  75.19(0.23) |  89.44(0.14) |  80.89(0.21) |
|   √   |   √   |  √  |  - |  75.30(0.20) |  89.93(0.14) |  81.93(0.24) |
|   √   |   √   |  -  |  √ |  75.23(0.18) |  89.64(0.12) |  81.25(0.19) | (newly added)
|   √   |   √   |  √  |  √ |  77.10(0.25) |  90.05(0.11) |  81.02(0.14) | (default)
—————————————————————————————————————————————————————————————————————————

[Q3]. The contribution of the Inter-FE block is not significant, according to Table 4. Standard deviation results should also be added for this experiment.

[A3]. We have also added the standard deviation results of our Inter-FE block in the below Table 3, which can better demonstrate the effectiveness of our Inter-stage Feature Interaction (Inter-FI) design.

Table 3: Ablation study on inter-stage feature interaction with Swin-T backbone. Swin-UNet with Swin-T pretrained on supervised ImageNet-1k is taken as a baseline. Numbers in () indicate standard deviations and numbers outside () indicate Dice scores.
———————————————————————————————————————————————————————————
| Method     |   Dice(ET)   |   Dice(WT)   |   Dice(TC)   |
———————————————————————————————————————————————————————————
| Intra-only |  77.10(0.25) |  90.05(0.11) |  81.02(0.14) |
| Add        |  75.79(0.25) |  88.99(0.10) |  79.00(0.19) |
| Max        |  75.22(0.16) |  89.72(0.16) |  81.41(0.29) |
| Concat     |  77.22(0.28) |  90.09(0.11) |  81.59(0.17) | (default)
———————————————————————————————————————————————————————————
评论

[Q4]. I also have concerns about the novelty of the proposed block. Channel-wise separable convolutions have already been widely used for parameter-efficient training, e.g. [1]. Also, given that the contributions of the FFT branch and InterFI block are very small, it seems that the main improvement of the accuracy comes from the channel-wise separable convolutions, which is already an existing technique. ([1] Chollet, Xception: Deep Learning with Depthwise Separable Convolutions - https://arxiv.org/abs/1610.02357)

[A4]. Although channel-wise separable convolutions have rarely shown innovation in the evolution of model backbone design, the potential of depthwise separable convolutions in the realm of parameter-efficient transfer learning has yet been under explored. Channel-wise separable convolutions are highly efficient in achieving a lightweight model structure, thus in this paper, we are among the first few works to investigate its potential in realizing 3D medical image segmentation through parameter-efficient transfer learning. Building on channel-wise separable convolutions, we propose a new Med-Adapter for PETL, serving as a plug-and-play component to address both multi-scale representations and inter-slice correlations. With our well-designed Med-Adapter, we introduce a new framework namely Med-Tuning, which achieves a balance between segmentation accuracy and parameter efficiency. Therefore, overall, we believe the novelty of our work should not be overshadowed by the incorporation of channel-wise separable convolutions.

[Q5]. In the current experiments, the proposed block is added after each block. How would the results change if it is used e.g. after only a few earlier blocks?

[A5]. Thanks for your great interests on our work. To be emphasized, the overall architecture of our method, namely Med-Tuning, consists of a commonly utilized decoder and a 2D Transformer backbone G pre-trained on large-scale natural images. As shown in Fig. 3, G has N stages and the n-th stage (n=1,2,...,N) has a specific number of Transformer blocks, our proposed Med-Adapters are integrated right after each Transformer block. Following your suggestion, we have conducted additional experiments to supplement the segmentation performance if it is inserted after only a few earlier blocks. As can be seen from the results presented in the Table 4 below, inserting our module after all blocks in the few initial stages consistently leads to deteriorated model performance, among which none can match the enhancement of our currently best setting.

Table 4: Ablation study on the position of inserted blocks with Swin-T backbone. Given that Swin-T encoder has four continuous stages, `√` indicates that the Med-Adapter is inserted at the corresponding layer.
———————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————
| Stage n=0 | Stage n=1 | Stage n=2 | Stage n=3 | Dice(ET) | Dice(WT) | Dice(TC) | Hausdorff(ET)| Hausdorff(WT)| Hausdorff(TC)|
———————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————
|     -     |     -     |     -     |     -     |   78.07  |   88.68  |   77.26  |     5.02     |     6.70     |     7.10     |
|     √     |     -     |     -     |     -     |     -    |     -    |     -    |       -      |      -       |      -       |
|     √     |     √     |     -     |     -     |   74.83  |   87.09  |   72.94  |     7.26     |    13.12     |    10.17     |
|     √     |     √     |     √     |     -     |   75.60  |   86.79  |   73.41  |     8.44     |    12.32     |    11.24     |
|     √     |     √     |     √     |     √     |   78.51  |   89.68  |   80.44  |     4.00     |     5.52     |     5.76     |
———————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————
评论

Thanks to the authors for addressing most of my comments. I also read the other reviewers' comments and authors' responses, and my concern regarding the contribution level remains. Although I tend to increase my initial score, I still think the contribution is not significant to present this work in ICLR.

评论

We thank reviewers for the valuable feedback, and the positive comments on research perspective or novelty (Reviewer y6pK, Reviewer SCo3, Reviewer bQHn), SOTA performance (Reviewer g2xi, Reviewer y6pK, Reviewer SCo3), paper organization (Reviewer g2xi, Reviewer y6pK, Reviewer SCo3, Reviewer bQHn) and extensive evaluation (Reviewer g2xi, Reviewer y6pK, Reviewer SCo3). We address all the reviewers' comments below and have incorporated all feedback in the revised manuscript <u>with blue color</u>. We sincerely aspire that our detailed rebuttal will dispel any lingering uncertainties reviewers may have regarding our manuscript, thereby contributing positively to the final evaluation.

Please allow me to emphasize once again, the main contributions of this work can be summarized as follows:

(1)We present a study on PETL for medical volumetric segmentation and propose a new framework Med-Tuning, achieving the trade-off between segmentation accuracy and parameter efficiency.

(2)A new Medical Adapter (Med-Adapter) is proposed for PETL, as a plug-and-play component to simultaneously consider both multi-scale representations and inter-slice correlations.

(3)Our framework is generic and flexible, which can be easily integrated with common Transformer-based architectures to greatly reduce training costs and simultaneously boost model performance.

(4)Extensive experiments on three benchmark datasets with different modalities (e.g. CT and MRI) validate the effectiveness of our Med-Tuning over full fine-tuning and previous PETL methods for medical volumetric segmentation.