PaperHub
7.8
/10
Poster4 位审稿人
最低3最高6标准差1.1
3
6
5
5
3.3
置信度
创新性3.0
质量3.0
清晰度3.3
重要性3.0
NeurIPS 2025

UniMRSeg: Unified Modality-Relax Segmentation via Hierarchical Self-Supervised Compensation

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29

摘要

关键词
Incomplete ModalitiesSelf-supervised LearningHierarchical Compensation

评审与讨论

审稿意见
3

The current paper presents UniMRSeg a unified multi-modal approach to image segmentation that can be trained to be robust in the event of missing modality at inference. The main contribution there is Hierarchical Self-Supervised Compensation (HSSC), a three-step training scheme that aims at making representations generated using incomplete modalities comparable to those generated using complete ones. The process works in the following way:

The (1) pre-training stage is based on the masked modality reconstruction and vigorous data augmentation (shuffling, masking) in order to develop cross-modal correspondences.

A further step drops modality-specific supervision and contrastively learns the full and incomplete modalities in the feature space at the same time enforcing the downstream supervision at the segmentation task.

The last stage refines the model, tames the encoder parameters and implements two new mechanisms, a so-called reverse attention adapter that inverts the spatial attention to leverage the existing spatial knowledge, and the consistency constraints at the prediction level to distill the knowledge of the full-modality reference.

The unique strength of the suggested framework is that only one parameter-shared model is easily capable of processing an arbitrary subset of available modalities, a trait which was supported by empirical findings on brain tumor, salient object, and semantic segmentation benchmarks.

优缺点分析

The current paper provides a brief but serious research of the issue of missing data in multi-modal learning: a clinically and scientifically significant issue. It has made its most significant contribution in developing a single framework which has been claimed to do inference without the full modal information. Trained as a three-stage training pipe-line, UniMRSeg utilises masked reconstruction, contrastive learning, and distillation modules that work across representation layers featuring different SNR regimes. State-of-the-art performance on four representative tasks with low variance, including brain tumour detection, salient object extraction, RGB-D segmentation and RGB-T semantic segmentation, empirically confirms the generalisability of the proposed across cohorts.

Regardless of this strength, a number of caveats should be mentioned. Methodologically, the suggested framework is predominantly based on the existing self-supervised frameworks; the main novelty is in the sequential orchestration of them and not that of the novelties themselves. The training pipeline is also quite complex: three consecutive optimisation processes should be running simultaneously with repetitive freezing and unfreezing of various components of the network. This complexity can be threatening to reproducibility, hyperparameter sensitivity and challenging to deploy in practice. This means that the ablation insights provided by the authors, as interesting as they might be, cannot explain the need of such ordered pipeline; a direct comparison with the end-to-end architecture would have made the marginal gains brought about by the intermediate intermediate contrastive losses clear.

问题

The introduced three-stage complexity is the main contribution of the paper but, at the same time, it is a challenging element since it is very complex. There should be a stronger case as to why this sequential arrangement is preferable over a more direct, one-shot, end-to-end training regime that consolidates all the suggested losses, reconstruction, contrastive, and consistency, as multi-tasks. A comparison of the multi-stage procedure with such a single-stage alternative in an ablation would go a long way to support the case of taking up the added complexity.

There is also a need to have clarity with regard to reverse attention adapter. The authors explain the mechanism as the reverse attention and its mathematical and intuitive definitions are still ambiguous. What exactly is the reversed? What does it mean by explicitly compensate the weak perceptual semantics? Comparing this design with a more standard attention mechanism that learns to emphasize prominent shared feature, and also an ablation analysis would help in understanding the need of making the modification.

It would be useful to have these ablations in order to isolate contribution of individual stages better. As a case in point, how does the baseline model trained on purely the Stage 3 machinery- adapters + consistency loss, fine-tune without Stage 1 and 2 pre-training? Such comparison would give an insight on the load of the pre-training against the final refinement stage. More so, it would also be informative to benchmark the effectiveness of a baseline model that has been fine-tuned only on Stage 1 reconstruction, and then you would perform the standard fine-tuning on segmentation without Stage 2 or 3.

Concerning the base line comparisons of previous works, as shown in Tables 1 and 2, there is need to give more information. Had the baseline models been trained end-to-end as per their papers or had they been trained under a multi-stage or pre-training regime to be fairly comparable? In case of end-to-end overlap, identified improvements in performance provided by UniMRSeg could, in some part, be attributed to large pre-training and not only to architectural or methodological advances.

局限性

The training pipeline, which is quite complex, is mentioned by the authors as one of the acknowledged limitations in the appendix. Although this observation is relevant, the issue should be better highlighted in the body of the paper since it relates to the viability and usability of the given approach. The proposed future task to make the process easier is properly credited, but at the same time it highlights the complexity of the situation at the present.

格式问题

No

作者回复

Dear Reviewer mFFe,

We sincerely thank you for your valuable comments and questions, which help us a lot to improve our work. We address your questions as follows.


[W1, the suggested framework is predominantly based on the existing self-supervised frameworks; the main novelty is in the sequential orchestration of them and not that of the novelties themselves.]

Due to the character limit in this rebuttal, we kindly refer you to our response to Reviewer sDPL's [Q1 W1], where we address these concerns in detail.


[Q1, the introduced three-stage complexity ... a more direct, one-shot, end-to-end training regime that consolidates all the suggested losses, reconstruction, contrastive, and consistency, as multi-tasks ...]
[W2, the training pipeline is also quite complex ... ]
[W4, a direct comparison with the end-to-end architecture ...]

Thank you for your suggestion. Prior to submission, we actually conducted the single-model unified training experiment. However, we chose not to include it in the paper, as our intention was to help readers focus more clearly on the self-supervised benefits brought by each stage individually. This is because the distinction between multi-stage and single-stage training essentially reflects a difference between self-supervised representation learning and multi-task learning.

(1) Self-supervised Representation Learning (Three Stages)

Our work is built upon a self-supervised pretraining framework, with training initialized from scratch starting at Stage 1. For each task (e.g., BraTS 2020), self-supervised training is conducted solely on the task’s own training set, rather than relying on externally labeled datasets such as ImageNet-pretrained weights for initialization. Our three-stage design includes:

  • Stage 1: Pretrain both the encoder and decoder through arbitrary modality reconstruction.
  • Stage 2: Pretrain the encoder with contrastive learning. The segmentation task here is only used to guide the contrastive objective, not as a final goal.
  • Stage 3: Perform the downstream segmentation task.

This design explicitly aims to reduce the representation gap between complete and incomplete modalities in the encoder-decoder space during the final segmentation stage. If all these tasks are trained jointly in a single-stage model, it would fall into the scope of multi-task learning, rather than self-supervised pretraining.

(2) Multi-task Learning (Single Stage)

In multi-task learning, the goal is to leverage multiple complementary clues for collaborative learning. These clues can come from the data level [1,2,3], or from structural supervision types [4,5,6]. Most approaches often incorporate deliberate designs for task-sharing and task-specific components to enable effective joint prediction across tasks. In this work, if the three stages are forcibly merged into a single training stage, the following issues will arise:

  • The model is trained with inputs that involve Random Modality Dropout, Random Modality Shuffle, and Random Spatial Masking. The encoder is supervised using the NT-Xent contrastive loss, an adapter is incorporated, and the model is simultaneously tasked with both segmentation and reconstruction. Such a fully entangled training setup lacks a clear task hierarchy, and there is no explicit coordination among the input, encoder, decoder, and output.
  • When all tasks are treated equally without ordering, joint optimization becomes highly difficult. In the unified single-stage training attempt, the total loss involves six parts. The results on BraTS 2020 are as follow:
SettingWhole Dice (%) ↑Core Dice (%) ↑Enhancing Dice (%) ↑Training StabilityConvergence
Three Stages (Ours)80.6473.3363.10✓ Stable✓ Easy
Single Stage (Unified Loss)20.3213.6710.03✗ Unstable✗ Poor

We observed the following phenomena during the experiments:

  • The single stage model failed to converge, with loss plateauing early.
  • The optimization process was highly unstable, with different losses fluctuating in turn and failing to decrease consistently together.

These phenomena and performace clearly indicate that unified single stage training is unable to achieve effective coordination among the various designs and supervision signals. The lack of a clear training order, combined with competing optimization objectives, leads to mutual interference.

In summary, we would like to clarify two potential misconceptions:

  1. The three-stage training is not overly complex. Both theoretically and practically, it is a simpler and more stable design, especially when compared to directly mixing all objectives and modules in a single unified training phase, which leads to optimization difficulties.
  2. The three stages are not repeating the same task. Instead, they form a self-supervised learning (SSL) pipeline that does not rely on extra data, and progressively reduce the gap between incomplete and complete modalities at the input, feature, and output levels to improve segmentation performance.

References:
[1] Segment Anything, ICCV 2023.
[2] SegGPT: Towards Segmenting Everything in Context, ICCV 2023.
[3] Spider: A Unified Framework for Context-dependent Concept Segmentation, ICML 2024.
[4] Deep Multitask Learning with Progressive Parameter Sharing, ICCV 2023.
[5] MTMamba: Enhancing Multi-Task Dense Scene Understanding by Mamba-Based Decoders, ECCV 2024.
[6] Multi-Task Dense Prediction via Mixture of Low-Rank Experts, CVPR 2024.


[Q2, ... What exactly is the reversed? What does it mean by explicitly compensate the weak perceptual semantics? Comparing this design with a more standard attention mechanism ... an ablation analysis would help in understanding the need of making the modification.]

(1) Definition of “Reverse” in Reverse Attention

"Reverse" refers to inverting the standard self-attention response. Traditional self-attention emphasizes high-activation (salient) regions. However, under missing modality conditions, some important semantic cues may receive low activation and be underrepresented. We compute reverse attention as 1 - self-attention to explicitly highlight these under-attended regions, enabling targeted semantic compensation.

(2) Purpose of Compensating Weak Perceptual Semantics

Ideally, if the features from incomplete and complete modalities are well aligned, their self-attention maps should exhibit similar activation distributions. In practice, due to missing inputs, incomplete modality features often lack activation in key semantic regions. These low-response areas typically contain complementary or suppressed information that standard attention overlooks. To address this, we introduce an adapter in Stage 3, trained to recover semantic cues from these weakly activated regions. Guided by reverse attention, the adapter focuses on residual perceptual gaps and compensates them explicitly. This enhances semantic completeness and improves prediction consistency under missing-modality scenarios.

(3) Ablation Study

We have conducted a detailed ablation analysis of the Reverse Attention Adapter (RAA) as shown in Table 4. The "mutual attention" setting corresponds to a standard attention mechanism implemented with a 3D Swin Transformer. The results show that removing reverse attention ("w/o Reverse Attention") leads to a significant performance drop for UniMRSeg, while the performance gap between "w/o Reverse Attention" and "w/o Reverse Attention + Mutual Attention" is minimal. This strongly supports the effectiveness of the proposed reverse attention design.

In addition, we refer you to our response to Reviewer sDPL’s [Q1 W1] "(3) Lightweight Reverse Attention Adapter (RAA)", which further clarifies the motivation and advantages of the RAA.


[Q3, It would be useful to have these ablations in order to isolate contribution of individual stages better ...]
[W3, This means that the ablation insights provided by the authors ... cannot explain the need of such ordered pipeline.]

We fully agree on the importance of isolating the contribution of individual stages. Accordingly, we have already conducted detailed ablation experiments covering each stage independently, as well as all their combinations, which are presented in Table 5 of the submitted version. We kindly invite you to check it.


[Q4, ... Had the baseline models been trained end-to-end as per their papers or had they been trained under a multi-stage or pre-training regime to be fairly comparable? ... performance provided by UniMRSeg could, in some part, be attributed to large pre-training and not only to architectural or methodological advances.]

First, we would like to clarify that UniMRSeg does not use any external data in any stage of training. The different UniMRSeg models used for various segmentation tasks (Brain Tumor Segmentation / RGB-D SOD / RGB-T SOD / RGB-D Semantic Segmentation) are trained on the corresponding task’s training set in all three stages.
Therefore, the performance gains of UniMRSeg are not attributable to large-scale pretraining, but rather to the proposed architectural and learning design.

Second, to ensure fair comparison across methods in Tables 1, 2, 6, and 7, all baseline models are evaluated under the same training dataset splits. Specifically:

  • If a method reports results using the same dataset splits as ours, we directly cite the numbers from the original paper.
  • If the results are missing (e.g., for missing modality settings), or if different splits are used, we retrain the models using their released code, following the original training protocols as closely as possible.

This ensures that all results are evaluated under consistent and fair condition.

评论

Dear Reviewer mFFe,

As we are entering the final days of the author-reviewer discussion period, we are not sure whether our rebuttals have addressed your concerns. We hope not to miss the opportunity to engage in a constructive exchange with you.

Thank you again for your time and consideration.

Best regards,
The Authors

评论

Dear Reviewer mFFe,

Thanks for your constructive comments and valuable suggestions to improve this paper.

As we are entering the final days of the author-reviewer discussion period, we are not sure whether our rebuttals have addressed your concerns. We would be greatly appreciated if you could spend some of your time for a reply.

Thank you again for your time and consideration.

Best regards,

The Authors

审稿意见
6

This paper proposes UniMRSeg, a unified segmentation framework designed to handle arbitrary missing modality combinations in multi-modal image segmentation tasks. UniMRSeg introduces a Hierarchical Self-Supervised Compensation (HSSC) strategy at three levels: (1) Input-level modality reconstruction using hybrid shuffled-masking; (2) Feature-level contrastive learning to obtain modality-invariant representations; (3) Output-level consistency enforced by lightweight reverse attention adapters. The model is validated across four benchmark tasks and achieves state-of-the-art performance with minimal variance across 21 missing modality combinations using a single unified model per task.

优缺点分析

Strengths:

1.The motivation is good. The modality-relax setting addresses a highly practical and important real-world challenge, particularly in clinical scenarios with incomplete or corrupted data. UniMRSeg provides an effective and scalable solution to this issue.

2.The HSSC is novelty. The proposed Hierarchical Self-Supervised Compensation strategy is a rare and elegant integration of multiple self-supervised mechanisms into a unified framework. Its design is well-aligned with the goal of modality-relax segmentation and contributes broader insights to the field of representation learning.

3.The performance is good. UniMRSeg achieves state-of-the-art performance across four segmentation tasks. Notably, the authors go beyond the conventional focus on average accuracy across modality combinations by additionally reporting standard deviation, which highlights the model’s robustness and stability.It’s a highly commendable and often overlooked aspect.

4.The ablation study is thorough. Tables 3–5 convincingly demonstrate the contribution of each component and the necessity of the three-stage training. Figure 5 is particularly effective in intuitively showing how each stage contributes to improved representation and segmentation quality.

Weaknesses:

1.Authors mainly focus on the analysis of modality combinations. It might be insightful to show the semantic importance. Whether some modalities (e.g., Flair vs T1ce) are inherently more informative or complementary to others.

2.Although UniMRSeg is evaluated across distinct domains (e.g., medical and natural scenes), the paper does not explore whether a model trained on one task (e.g., BraTS) could transfer to another task (e.g., RGB-D segmentation). A discussion of cross-task generalizability would further strengthen the work.

3.The current model does not explicitly infer which modalities are missing at test time. It would be helpful to understand whether this lack of modality-awareness could affect reliability in certain deployment settings.

问题

See the Weaknesses part.

局限性

Yes.

最终评判理由

Thanks for the authors' positive response. It has addressed all of my concerns.

格式问题

N/A.

作者回复

Dear Reviewer ife8,

Thank you very much for your valuable comments and suggestions, which are greatly helpful for improving both our current and future work. We address your questions as follows.


[W1, authors mainly focus on the analysis of modality combinations. It might be insightful to show the semantic importance. Whether some modalities (e.g., Flair vs T1ce) are inherently more informative or complementary to others.]

Thank you for this insightful comment. We agree that understanding the semantic importance of each modality is an important aspect of multi-modal learning. Our current work primarily focuses on evaluating robustness and generalization across arbitrary missing modality combinations, which explains our emphasis on combination-level performance (mean and standard deviation). However, we acknowledge that some modalities may be inherently more informative or complementary, especially in specific clinical tasks. In fact, as shown in Table 1 of the paper, we observe that combinations including FLAIR generally lead to higher Dice scores, indirectly suggesting its stronger contribution. This observation aligns with clinical knowledge where FLAIR is often more sensitive to edema and tumor boundaries. We will discuss the semantic importance in the final version.


[W2, Although UniMRSeg is evaluated across distinct domains (e.g., medical and natural scenes), the paper does not explore whether a model trained on one task (e.g., BraTS) could transfer to another task (e.g., RGB-D segmentation). A discussion of cross-task generalizability would further strengthen the work.]

We agree that cross-task generalization is an important topic, especially for unified segmentation models. In our current work, we design UniMRSeg to support task-specific but modality-relax generalization, i.e., within each segmentation domain (e.g., BraTS, SUNRGBD, STERE), the model handles all possible missing modality combinations using a single shared model. This is already a significant step toward practical unification, compared to previous works that require one model per combination. However, transferring a model trained on one task (e.g., brain tumor segmentation) to a fundamentally different task (e.g., RGB-D salient object segmentation) is non-trivial, due to large domain gaps in image appearance, semantics, and supervision forms. Such cross-task generalization is out of the current scope, but we agree it is a promising future direction. That said, our architecture is compatible with such a generalization setting. In future work, we plan to explore whether the shared encoder in UniMRSeg can be reused across domains via task-specific adapters or prompting mechanisms.


[W3, the current model does not explicitly infer which modalities are missing at test time. It would be helpful to understand whether this lack of modality-awareness could affect reliability in certain deployment settings.]

We would like to clarify that our UniMRSeg does not rely on explicit modality-type awareness during inference. To this end, we explicitly avoid designing an input pipeline that assumes modality-to-channel correspondence or requires modality labels. Instead, in Stage 1, we introduce a Random Modality Shuffle mechanism, where the order of input modalities is randomly permuted at each training iteration. This forces the model to learn modality-invariant representations rather than relying on fixed or predefined modality identity. In the main paper (Table 9), we also demonstrate that our model maintains stable segmentation performance across five different random shuffle orders during inference without any prior knowledge of modality types. This shows that the learned representation is robust to modality ambiguity, and the model behaves as if it had already inferred or become invariant to missing or shuffled modality types. Hence, while we do not adopt an explicit modality classification or selection module (which may introduce additional complexity or assumptions), our design intentionally achieves the goal of modality-agnostic reasoning in a simpler and more scalable way.

评论

Thanks for the authors' positive response. It has addressed all of my concerns.

I also carefully reviewed the comments from the other reviewers, and I agree with Reviewer #sDPL’s opinion “I consider this explanation quite adequate, and such trade-offs are within an acceptable range for the sake of better performance ” on the three-stage design.

Taking into account both the paper and the rebuttal, I now better appreciate that UniMRSeg’s three-stage design provides a more principled, stable, and interpretable optimization framework compared to a unified but chaotic single-stage setting. I am currently re-evaluating the overall contribution of this work and may consider further upgrade my rating.

评论

Thank you very much for your recognition and feedback. Following your suggestions, we will enrich the content of our paper in the final version.

审稿意见
5

This paper introduces UniMRSeg, a unified framework for multi-modal image segmentation that handles missing modalities through a three-stage hierarchical self-supervised compensation strategy. The approach combines multi-granular modality reconstruction, modality-invariant contrastive learning, and incomplete modality adaptive fine-tuning with a reverse attention adapter. The authors evaluate their method on brain tumor segmentation, RGB-D object segmentation, and claim improved performance across all (missing) modality combinations.

优缺点分析

  • Strengths
    1. The problem of missing modality segmentation and the limitations of current complex inference setups. The unified framework leverages a single model sharing across modality combinations, which is conceptually appealing and addresses a practical deployment concern.
    2. The reverse attention adapter design provides an interesting mechanism for identifying and compensating weak perceptual regions. Ablation also validates this design, removing the reverse attention mechanism causes a measurable drop (Table 4).
  • Weakness
    1. The training pipeline's complexity raises questions about its necessity. While the ablation studies (Tables 3 and 5) demonstrate that each stage contributes to overall performance, it remains unclear why these components cannot be integrated into a single, unified loss function for joint optimization.
    2. It is still not clear why fine-tuning of the encoder degrades performance (line 310-311). Could you please elaborate on that?
    3. NestedFormer does not handle missing modality; it is unclear how the benchmark is done (Table 1).

问题

See weakness

局限性

yes

最终评判理由

  • The rebuttal clarified key concerns, particularly around the necessity of multi-staged training and benchmark settings
  • I agree with other reviewers that the performance improvement outweighs the complexity of the method

格式问题

N/A

作者回复

Dear Reviewer PWbw,

Many thanks to your professional, detailed, and valuable reviews. We're going to response to your concerns one by one.


[W1, the training pipeline's complexity raises questions about its necessity. While the ablation studies (Tables 3 and 5) demonstrate that each stage contributes to overall performance, it remains unclear why these components cannot be integrated into a single, unified loss function for joint optimization.]

Thank you for your suggestion. Prior to submission, we actually conducted the single-model unified training experiment. However, we chose not to include it in the paper, as our intention was to help readers focus more clearly on the self-supervised benefits brought by each stage individually. This is because the distinction between multi-stage and single-stage training essentially reflects a difference between self-supervised representation learning and multi-task learning.

(1) Self-supervised Representation Learning (Three Stages)

Our work is built upon a self-supervised pretraining framework, with training initialized from scratch starting at Stage 1. For each task (e.g., BraTS 2020), self-supervised training is conducted solely on the task’s own training set, rather than relying on externally labeled datasets such as ImageNet-pretrained weights for initialization. Our three-stage design includes:

  • Stage 1: Pretrain both the encoder and decoder through arbitrary modality reconstruction.
  • Stage 2: Pretrain the encoder with contrastive learning. The segmentation task here is only used to guide the contrastive objective, not as a final goal.
  • Stage 3: Perform the downstream segmentation task.

This design explicitly aims to reduce the representation gap between complete and incomplete modalities in the encoder-decoder space during the final segmentation stage. If all these tasks are trained jointly in a single-stage model, it would fall into the scope of multi-task learning, rather than self-supervised pretraining.

(2) Multi-task Learning (Single Stage)

In multi-task learning, the goal is to leverage multiple complementary clues for collaborative learning. These clues can come from the data level [1,2,3], or from structural supervision types [4,5,6]. Most approaches often incorporate deliberate designs for task-sharing and task-specific components to enable effective joint prediction across tasks. In this work, if the three stages are forcibly merged into a single training stage, the following issues will arise:

  • The model is trained with inputs that involve Random Modality Dropout, Random Modality Shuffle, and Random Spatial Masking. The encoder is supervised using the NT-Xent contrastive loss, an adapter is incorporated, and the model is simultaneously tasked with both segmentation and reconstruction. Such a fully entangled training setup lacks a clear task hierarchy, and there is no explicit coordination among the input, encoder, decoder, and output.
  • When all tasks are treated equally without ordering, joint optimization becomes highly difficult. In the unified single-stage training attempt, the total loss involves six parts. The results on BraTS 2020 are as follow:
SettingWhole Dice (%) ↑Core Dice (%) ↑Enhancing Dice (%) ↑Training StabilityConvergence
Three Stages (Ours)80.6473.3363.10✓ Stable✓ Easy
Single Stage (Unified Loss)20.3213.6710.03✗ Unstable✗ Poor

We observed the following phenomena during the experiments:

  • The single stage model failed to converge, with loss plateauing early.
  • The optimization process was highly unstable, with different losses fluctuating in turn and failing to decrease consistently together.

These phenomena and performace clearly indicate that unified single stage training is unable to achieve effective coordination among the various designs and supervision signals. The lack of a clear training order, combined with competing optimization objectives, leads to mutual interference.

In summary, we would like to clarify two potential misconceptions:

  1. The three-stage training is not overly complex. Both theoretically and practically, it is a simpler and more stable design, especially when compared to directly mixing all objectives and modules in a single unified training phase, which leads to optimization difficulties.
  2. The three stages are not repeating the same task. Instead, they form a self-supervised learning (SSL) pipeline that does not rely on extra data, and progressively reduce the gap between incomplete and complete modalities at the input, feature, and output levels to improve segmentation performance.

References:
[1] Segment Anything, ICCV 2023.
[2] SegGPT: Towards Segmenting Everything in Context, ICCV 2023.
[3] Spider: A Unified Framework for Context-dependent Concept Segmentation, ICML 2024.
[4] Deep Multitask Learning with Progressive Parameter Sharing, ICCV 2023.
[5] MTMamba: Enhancing Multi-Task Dense Scene Understanding by Mamba-Based Decoders, ECCV 2024.
[6] Multi-Task Dense Prediction via Mixture of Low-Rank Experts, CVPR 2024.


[W2, It is still not clear why fine-tuning of the encoder degrades performance (line 310-311). Could you please elaborate on that ?]

We respond from two perspectives: the inheritance between Stage 2 and Stage 3, and the design rationale of the lightweight reverse attention adapter.

(1) Inheritance between Stage 2 and Stage 3

Unlike general SSL methods (e.g., SimCLR, MoCo, MAE) that focus on learning generic representations without targeting specific downstream tasks, our contrastive learning in Stage 2 is task-aware. As stated in Lines 197–199, we co-train the segmentation head to guide the encoder toward learning modality-invariant features that are beneficial for segmentation, rather than general-purpose representations. This task-guided design ensures that the learned contrastive space aligns with the downstream segmentation objective. As shown in Table 3, this approach significantly boosts segmentation performance. Therefore, fine-tuning the encoder in Stage 3 would undermine the task-guided contrastive representations we carefully established.

(2) Lightweight Reverse Attention Adapter (RAA)

The RAA is designed as a residual correction to bridge the encoder representation gap between incomplete and complete modality inputs during Stage 3. Formally:

finc+A(finc)fcom,f_{\text{inc}} + \mathcal{A}(f_{\text{inc}}) \approx f_{\text{com}},

where:

  • fincf_{\text{inc}}: encoder features from incomplete modality input,
  • fcomf_{\text{com}}: encoder features from the complete modality input,
  • A()\mathcal{A}(\cdot): the learnable adapter.

During Stage 3, both fincf_{\text{inc}} and fcomf_{\text{com}} are frozen, and only A\mathcal{A} is trained. This well-constrained setup ensures that the adapter focuses purely on compensating the missing information, enabling stable and efficient optimization. Once the encoder is unfrozen, all three components become variables, making their roles unclear and the optimization unstable. This is why the "finetune" setting in Table 4 leads to significant performance degradation.

In conclusion, freezing the encoder in Stage 3 is a deliberate and necessary design choice to preserve the task-guided contrastive representations learned earlier. We will make this motivation and its empirical evidence more explicit in the revised manuscript. Thank you again for the insightful question.


[W3, NestedFormer does not handle missing modality; it is unclear how the benchmark is done (Table 1).]

Thank you for pointing this out. NestedFormer is not designed for missing modality settings. We include it solely as a comparative baseline to highlight that segmentation methods designed for complete modalities perform significantly worse under missing modality conditions, due to the lack of explicit modeling for missing inputs. In contrast, all other methods in Table 1 are specifically designed to handle missing modality. We will clarify this purpose in the revised paper.


评论

Thanks for the detailed rebuttal; my concerns regarding the complexity vs. performance trade-off are addressed. The authors show advantages of three-staged training vs. one unified loss. I have raised my score accordingly.

评论

Thank you very much for your feedback and for taking the time to re-evaluate our work. We sincerely appreciate your thoughtful comments and your recognition of our efforts in the rebuttal. Your updated rating is truly encouraging for us.

评论

Dear Reviewer PWbw,

Thanks for your constructive comments and valuable suggestions to improve this paper.

As we are entering the final days of the author-reviewer discussion period, we are not sure whether our rebuttals have addressed your concerns. We would be greatly appreciated if you could spend some of your time for a reply.

Thank you again for your time and consideration.

Best regards,

The Authors

审稿意见
5

The paper proposes the UniMRSeg unified modal relaxation segmentation framework, which addresses the issue of incomplete modalities in multi-modal segmentation through a hierarchical self-supervised compensation strategy (HSSC). The framework bridges the gap between complete and incomplete modal representations across input, feature, and output levels. Specifically, it is implemented through: 1) multi-granularity modal reconstruction, 2) modal-invariant contrastive learning, 3) lightweight reverse attention adapters, and 4) hybrid consistency constraint fine-tuning. This approach eliminates the need for training dedicated models for different modal combinations, reducing deployment costs.

优缺点分析

The proposed model demonstrates superior performance. The article is composed in a lucid and meticulous manner. The insights provided for augmenting segmentation task performance in the multi-modal context are distinctive, and the paper's foundational premise is quite novel. The article presents a clear and rigorous approach to addressing the problem, which is highly commendable. Though the article possesses certain merits, there remain some shortcomings: The motivation of the article is clear, but the novelty of the proposed method may be lacking. The article is still deficient in experimentation, with an insufficient number of comparative methods, which may not fully showcase the superior performance of this model. The article may have issues with unclear representations in some of the figures.

问题

  1. The article may lack certain novelty, such as in the design of the Perturbation and Mask components, as these techniques have likely been extensively explored in domains like computer vision, natural language processing, or time series analysis. The article does not appear to clearly highlight a more distinctive motivation or design, mainly extending existing technologies. Additionally, the incorporation of lightweight reverse attention adapters in the model may also lack sufficient innovation, so the core innovative aspects of the paper need to be better emphasized.
  2. The article requires improvement in the visual presentation aspects, such as in the rendering of Figure 1 and Figure 2. The figures currently contain limited effective information or may not fully reflect the model's own structure or characteristics, necessitating further refinement.
  3. The article should compare its proposed model against a more competitive set of baseline models, and expand the experimental scale to further elucidate the performance advantages of the model presented in the paper.

局限性

The author mentions the limitations of the study in the article, mainly that the three-stage training strategy increases the complexity of the training process and introduces additional training overhead, which may pose challenges for practitioners with limited time or resources. I consider this explanation quite adequate, and such trade-offs are within an acceptable range for the sake of better performance. For other limitations, please refer to the Questions section.

最终评判理由

The author further elaborated on the innovation of the proposed method in a specific context, enabling me to gain a more thorough understanding of the article's efforts and applications in promoting community development. In addition, the author provided a detailed explanation of why the baseline of the comparison method is limited and actively offered improvement plans, based on which I will reasonably raise my rating.

格式问题

The article has no formatting issues.

作者回复

Dear Reviewer sDPL,

Thank you for your time and careful review of our work. Below we address your questions and weaknesses mentioned:


[Q1 W1, the article may lack certain novelty, such as in the design of the Perturbation and Mask components, as these techniques have likely been extensively explored in domains like computer vision, natural language processing, or time series analysis. The article does not appear to clearly highlight a more distinctive motivation or design, mainly extending existing technologies. Additionally, the incorporation of lightweight reverse attention adapters in the model may also lack sufficient innovation, so the core innovative aspects of the paper need to be better emphasized.]

We understand the concern regarding the novelty of individual components. We would like to emphasize that the novelty of UniMRSeg does not lie in standalone algorithmic innovations, but rather in the hierarchical, task-driven integration of multiple self-supervised mechanisms, jointly optimized to address the unique challenges of modality-relax segmentation. Specifically:

(1) Hierarchical Self-Supervised Compensation (HSSC) Framework

(as already acknowledged and appreciated in your previous review)

UniMRSeg unifies pixel-level modality reconstruction, feature-level contrastive learning, and prediction-level label distillation into a single annotation-free compensation framework. Unlike previous works that treat self-supervised tasks in isolation, HSSC employs a three-stage curriculum, jointly optimized with a single shared model per task, enabling robust generalization across arbitrary missing-modality combinations.

(2) Hybrid Data Perturbation

While data perturbation techniques have been explored in various tasks, we design a hybrid perturbation mechanism specifically tailored to the modality-relax setting:

  • Random modality dropout simulates missing inputs;
  • Modality shuffle removes the reliance on fixed modality-channel mapping, encouraging modality-agnostic encoding;
  • Spatial masking mimics local degradation.

As discussed in Lines 585–606 and validated in Table 9, conventional models depend on strict input order and modality metadata, which limits their robustness in real-world scenarios. Our strategy removes these assumptions, improving deployment flexibility and generality.

(3) Lightweight Reverse Attention Adapter (RAA)

The RAA is designed as a residual correction to bridge the encoder representation gap between incomplete and complete modality inputs during Stage 3. Formally:

finc+A(finc)fcom,f_{\text{inc}} + \mathcal{A}(f_{\text{inc}}) \approx f_{\text{com}},

where:

  • fincf_{\text{inc}}: encoder features from the incomplete modality input,
  • fcomf_{\text{com}}: encoder features from the complete modality input,
  • A()\mathcal{A}(\cdot): the learnable adapter.

During Stage 3, both fincf_{\text{inc}} and fcomf_{\text{com}} are frozen, and only A\mathcal{A} is trained. This well-constrained setup ensures that the adapter focuses purely on compensating the missing information, leading to stable and efficient optimization.

To further enhance the adapter’s ability to capture under-represented semantic cues, we apply reverse attention to amplify the residual signal. In this way, RRA can focus on the non-salient or poorly attended regions that conventional self-attention may overlook.

Overall, our original intention was not to pursue isolated innovations on specific components or modules by introducing increasingly "fancy" designs in the hope of achieving strong results for unified modality-relax segmentation. While such improvements may be valuable directions for future work, they are not the focus of this paper. Instead, this work aims to demonstrate that by strictly adhering to the motivation of Hierarchical Self-Supervised Compensation (HSSC) and designing around it in a principled manner, UniMRSeg is able to achieve strong and consistent performance across tasks. We hope that the motivation and explanation above can be well understood and appreciated.


[Q2 W3, the article requires improvement in the visual presentation aspects, such as in the rendering of Figure 1 and Figure 2. The figures currently contain limited effective information or may not fully reflect the model's own structure or characteristics, necessitating further refinement. ]

Thank you for your suggestion. In the revised version, we will improve Figure 1 and Figure 2 as follows:

(1) Figure 1

We will split the current figure into two subfigures for clarity:

  • Figure 1a: A clear, labeled block diagram of the full architecture, including the encoder, decoder, perturbation module, contrastive learning module, and reverse attention adapter.
  • Figure 1b: The benchmark box plot will be retained, with expanded labels and annotations to improve readability.

(2) Figure 2

We will restructure Figure 2 to better highlight the three-stage training pipeline and the interconnections between stages, including:

  • Inputs and outputs at each stage;
  • The flow of gradients and supervision signals (e.g., reconstruction, contrastive, and segmentation losses);
  • The dependency and progression between stages, as discussed in Response to [Q1 W1], emphasizing how Stage 1 prepares robust features, Stage2 aligns them with segmentation objectives, and Stage 3 adapts to incomplete modalities using frozen representations.

[Q3 W2, the article should compare its proposed model against a more competitive set of baseline models, and expand the experimental scale to further elucidate the performance advantages of the model presented in the paper.]

Thank you for your suggestion. We fully agree that strong baselines and broad evaluations are essential to demonstrate the effectiveness and generality of our proposed method.

(1) On Baseline Competitiveness

We would like to emphasize that Tables 1, 2, 6, and 7 already include state-of-the-art and highly competitive methods published in top-tier conferences/journals, such as:

  • PASSION (MM 2024)
  • GateNet (IJCV 2024)
  • EonTRINet (PAMI 2024)
  • MaskMentor (MM 2024)
  • M3FeCon (MICCAI 2024)

These methods represent the most advanced techniques in multi-modal/ missing modality segmentation. Most importantly, they are among the few methods with publicly released code and pre-trained models, enabling fair and reproducible comparisons under consistent settings.

While other recent approaches do exist, many do not provide released implementations, making standardized evaluation infeasible. We are cautious about comparing only reported numbers without aligned settings.

In order to fully address your suggestion and further strengthen the persuasiveness of our results, we report a performance comparison with the recently published model LS3M [1] (CVPR 2025, code not yet released). Although its implementation is unavailable, the reported results on BraTS 2018 are relatively aligned with our evaluation setup, allowing for an approximate comparison.

MethodWhole Dice (%) ↑Core Dice (%) ↑Enhancing Dice (%) ↑
LS3M88.0179.2863.31
UniMRSeg88.9080.6767.02

[1] Incomplete Multi-modal Brain Tumor Segmentation via Learnable Sorting State Space Model, CVPR 2025.

(2) On Evaluation Scope

UniMRSeg has been evaluated across diverse domains, including:

  • Medical segmentation: BraTS 2018, BraTS 2020
  • RGB-D semantic segmentation: SUN-RGBD
  • RGB-D salient object detection: STERE
  • RGB-T salient object detection: VT1000

Each dataset is tested under all possible missing modality combinations, with a single unified model per task, to assess both task-level and modality-level generalization. The scope of evalution is broader than most previous works.

(3) Planned Expansions in Final Version

To further strengthen the evaluation, we will:

  • Include comparisons with generalist segmentation models (e.g., SAM2, SegGPT) under in-context setups using prompt masks.
  • Extend the evaluation to additional RGB-D SOD and SS datasets, such as SIP, NJUD, NLPR, and NYUD.

In summary, our current evaluation already reflects a wide and competitive scope. We are committed to further expanding the benchmarks and baselines to reinforce the value of our work. Thank you again for the constructive suggestion.


[Limitations: The author mentions the limitations of the study in the article, mainly that the three-stage training strategy increases the complexity of the training process and introduces additional training overhead, which may pose challenges for practitioners with limited time or resources. I consider this explanation quite adequate, and such trade-offs are within an acceptable range for the sake of better performance.]

Thank you very much for acknowledging our explanation regarding the three-stage training strategy. We also kindly refer you to our response to Reviewer PWbw’s [W1], where we further elaborate on the necessity of the three-stage pipeline and demonstrate its advantages over a unified single-stage alternative. We appreciate your understanding of the trade-off between performance and complexity.

评论

Thanks for the author's positive response. It has addressed all my concerns. The effort put into such a detailed explanation is highly appreciated, and I have improved my overall rating.

评论

Thank you very much for your feedback and for taking the time to re-evaluate our work. We sincerely appreciate your thoughtful comments and your recognition of our efforts in the rebuttal. Your updated rating is truly encouraging for us.

最终决定

The rebuttal addressed most concerns raised by reviewers, e.g., innovation and complexity. Three reviewers are very certain that they land on the positive side. The remaining one reviewer gave negative reviews originally, but did not participate in any discussion or even did not read the rebuttal. Thus, the weight of this reviewer is largely lowered down. The AC recommends acceptance of the paper. The authors are required to include the discussions during the rebuttal into the final version.