6.8

/10

Poster4 位审稿人

最低3最高5标准差0.8

3.8

置信度

创新性2.5

质量3.0

清晰度3.0

重要性2.5

NeurIPS 2025

LoMix: Learnable Weighted Multi-Scale Logits Mixing for Medical Image Segmentation

Md Mostafijur Rahman,Radu Marculescu

OpenReview PDF

提交: 2025-05-12更新: 2025-10-29

TL;DR

Learnable Weighted Multi‑Scale Logits Mixing

摘要

U‑shaped networks output logits at multiple spatial scales, each capturing a different blend of coarse context and fine detail. Yet, training still treats these logits in isolation—either supervising only the final, highest‑resolution logits or applying deep supervision with identical loss weights at every scale—without exploring *mixed‑scale* combinations. Consequently, the decoder output misses the complementary cues that arise only when coarse and fine predictions are fused. To address this issue, we introduce LoMix ($\underline{Lo}$gits $\underline{Mix}$ing), a Neural Architecture Search (NAS)‑inspired, differentiable plug-and-play module that **generates** new mixed‑scale outputs and **learns** how exactly each of them should guide the training process. More precisely, LoMix mixes the multi-scale decoder logits with four lightweight fusion operators: addition, multiplication, concatenation, and attention-based weighted fusion, yielding a rich set of synthetic “mutant’’ maps. Every original or mutant map is given a softplus loss weight that is co‑optimized with network parameters, mimicking a one‑step architecture search that automatically discovers the most useful scales, mixtures, and operators. Plugging LoMix into recent U-shaped architectures (i.e., PVT‑V2‑B2 backbone with EMCAD decoder) on Synapse 8‑organ dataset improves DICE by +4.2% over single‑output supervision, +2.2% over deep supervision, and +1.5% over equally weighted additive fusion, all with **zero** inference overhead. When training data are scarce (e.g., one or two labeled scans, 5% of the trainset), the advantage grows to +9.23%, underscoring LoMix’s data efficiency. Across four benchmarks and diverse U-shaped networks, LoMiX improves DICE by up to +13.5% over single-output supervision, confirming that learnable weighted mixed‑scale fusion generalizes broadly while remaining data efficient, fully interpretable, and overhead-free at inference. Our implementation is available at https://github.com/SLDGroup/LoMix.

关键词

LoMix SupervisionMedical Image SegmentationU-shaped NetworkCombinatorial Multi-scale FusionLogits Mixing

评审与讨论

审稿意见

评分: 3置信度: 42025-06-26

This paper introduces LoMix (Logits Mixing), a novel training module designed to improve the performance of U-shaped networks in medical image segmentation. It assigns a learnable weight to every loss term.

Numerous experiments were conducted on datasets such as Synapse Multi-organs, ACDC cardiac organs, and BUSI. They also tested their method using limited data (5%, 10%, etc.), and the experiments demonstrated promising results.

Moreover, a comprehensive ablation study was performed on the fusion operation, the loss term, and the backbone. Show that every module is working.

优缺点分析

Strengths:

The paper is well written, and they conducted many experiments.
I like the ablation part. They did thorough ablation studies: different fusion operations, fixed loss term vs learnable loss term, different backbones.

Weaknesses:

Confusing comparison, the author compares their method with UNet, SwinUNet, TransUNet, PVT-CASCADE, and others (10+ methods) for no specific reason. Since the proposed method is built upon PVT-EMCAD, the only truly relevant baseline is PVT-EMCAD, whose worst-case performance should be equivalent to PVT-EMCAD (i.e., when all irrelevant loss terms become 0). The inclusion of other methods is unnecessary, as these exact comparisons were already published in the original PVT-EMCAD paper. If you check the tables from this paper and PVT-EMCAD, they are the same.
The novelty of this paper is limited. There are two modules introduced in this paper: multi-scale feature(logit) fusion and learnable loss weights. The two concepts are widely used in computer vision. The first module looks to me like a late layer fusion module (add, concatenation, multiplication, attention weighted fusion), and the second module is learnable loss weights. All these modules are off-the-shelf. It's hard to say this is the "first approach to enable fully learnable multi-scale fusion in segmentation" (line 121).
Missing comparison: Since one of the major contributions is introducing learnable weights, they should compare their NAS-based loss weights with other dynamic loss methods. Is there any specific reason or technical barriers that stop doing such a comparison?

问题

Clarify the effectiveness of different modules: how do learnable weights compare with other dynamic loss weights methods? How are your late fusion methods different from others? My opinion would improve if the authors could differentiate their work from methods that use similar individual components in experiments.
Carefully choose your comparison method. It is very confusing to include so many irrelevant methods. In fact, the ablation study should be the main result. And the author should write the main table in this way: UNet/UNet+other fusion methods/UNet+ours; PVT-EMCAD/PVT-EMCAD+other fusion methods/PVT-EMCAD+ours. I would improve my rating if the author could improve the main results table.
Argue more about the novelty of the module. In my opinion, applying long-standing, off-the-shelf methods to a classical problem can hardly be considered novel research for NeurIPS; It is definitely not the "first approach to enable fully learnable multi-scale fusion in segmentation". Please see [1] [2], they are very classical papers in computer vision and were written nearly 8 years ago. I would suggest citing the paper in the related work and arguing more about the difference between the proposed methods and these classical methods. My rating could be improved if the author could convince me of the novelty of this paper.

[1] Liu, Shu, et al. "Path aggregation network for instance segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

[2] Zhao, Qijie, et al. "M2det: A single-shot object detector based on multi-level feature pyramid network." Proceedings of the AAAI conference on artificial intelligence. Vol. 33. No. 01. 2019.

局限性

yes

最终评判理由

I commend the author for their detailed experiment and ablation study. The author attempts to introduce a multi-scale loss term with learnable weights, referred to as LoMix, into the training process.

Although the author claims to solve a "single task (segmentation) with dozens of highly correlated, multi-scale logit losses synthesized using our multi-operator combinatorial fusion," they do not provide a specific design for this process.

The entire process consists of setting the weights as nn.Parameters. The only constraint is to ensure the weights are positive (enforced via a softplus function), which is arguably a default requirement when dealing with loss values.

For this reason, I would give a rating of 3.

格式问题

作者回复

2025-07-30

We sincerely thank reviewer for the constructive feedback. Our responses are given below.

Q1.1. How do learnable weights compare with other dynamic loss weights methods?

Response: To our knowledge, there is no prior method that dynamically balances dozens of multi‑scale, single‑task supervision losses the way LoMiX does. Prior work falls into two categories:

Deep supervision in segmentation (fixed or hand‑scheduled weights): Classical deeply‑supervised/multi‑scale segmentation papers use side heads but keep their weights constant or manually decayed, e.g., U‑Net++ [7] and DSN [8]. None of these adapt weights data‑dependently across combinatorial multi-scale heads.
Dynamic loss weighting designed for multiple (heterogeneous) tasks: GradNorm [9] and Uncertainty weighting [10] address multi‑task learning where the number of losses is small (typically 2–5). Their normalization schedules or gradient‑matching heuristics may not scale well to dozens of highly correlated single‑task losses (one per fusion path).

LoMiX fills this gap by reformulating the single‑task, multi‑scale supervision as a multi-operator, combinatorial, fully learnable problem at logit/loss level. Each fusion path gets its own Softplus weight, so the optimizer can automatically suppress the unhelpful paths and emphasize the useful ones. In our ablations, fixed/uniform weights baselines underperform our NAS‑inspired weighting significantly (Fig. 4 and Supplementary Table S7). If a truly comparable single‑task, multi‑scale dynamic weighting method exists, we will gladly cite and benchmark it.

Q1.2. How LoMiX different from other late-fusions?.

Response: This is how LoMiX is different:

Where/when it operates: We fuse only during training, thus having zero extra parameters or latency overhead at inference. Late‑fusion papers [1,2] keep their fusion blocks at inference, thus adding overhead.
What is fused: We fuse multi-scale decoder logits, not high‑dimensional features, using multiple operators (Add, Mult, Concat, AWF) over all non‑empty subsets, yielding a combinatorial, diverse set of supervision signals rather than a single fixed fusion path.
Why we fuse: Our goal is a stronger, more diverse supervision, i.e., an implicit ensemble that improves gradients (especially in data‑scarce regimes), not a deeper architecture.
How do we optimize: Each original/fused logit gets a NAS‑inspired, Softplus loss weight learned end‑to‑end; prior works [1,2] do not auto‑weight dozens of fusion paths.
Empirical evidence: Operator/subset ablations (Fig. 3 and Supplementary Table S6) and weight‑dynamics plots (Figs. S1-S8) show that different fusion types/resolutions do matter, and the learned weights self-rebalance over training.

To sum up, LoMiX is not a late fusion block with off‑the‑shelf operations; instead, it is a new combinatorial, and automatically weighted supervision methodology at training-time. This clearly differentiate it from methods that use similar components (names), but with completely different purposes.

Q2 & W1. Comparison methods and restructuring main table.

Response: The confusion stems from considering LoMiX a PVT‑EMCAD add‑on. In fact, LoMiX is a new architecture‑agnostic supervision methodology at training‑time, so we must show that it works across multiple networks, not just one.

Fig. 5 and Table S8 already show that LoMiX scales well across different CNN/Transformer encoders with the EMCAD decoder. To restructure Table 1 as per-network progression, we run new experiments with different supervision across representative CNN, MLP and Transformer hybrid networks. The results show that LoMiX achieves the best DICE for every network (see Table R6). We will add the full restructured Table 1 in the revised paper. This will directly show "what does LoMiX add beyond existing supervisions?"

Table R6: Synapse 8-organ segmentation with Last Layer (LL), Deep Supervision (DS) [7], MUTATION [18], and LoMiX. DICE scores (%) reported for Gallbladder (GB), Left kidney (KL), Right kidney (KR), Pancreas (PC), Spleen (SP), and Stomach (SM). We report results of only three networks here for space limits.

Methods	Avg. DICE	Avg. HD95	mIoU	Aorta	GB	LK	RK	Liver	PC	SP	SM
UNet + LL	70.1	44.7	59.4	84.0	56.7	72.4	62.6	87.0	48.7	81.5	67.9
+ DS	77.8	26.9	68.3	85.4	68.0	81.4	76.2	91.4	56.9	87.6	75.9
+ MUTATION	81.5	26.4	71.8	89.4	70.5	85.4	80.4	94.1	66.3	88.3	77.5
+ LoMiX (Ours)	83.6	24.3	74.6	90.4	75.3	86.3	82.5	94.3	67.9	91.8	80.2
TransUNet + LL	77.6	26.9	67.3	86.6	60.4	80.5	78.5	94.3	58.5	87.1	75.0
+ DS	82.7	17.3	73.5	86.6	68.5	87.7	84.6	94.4	65.3	90.8	83.5
+ MUTATION	83.0	17.0	73.9	89.3	63.7	86.9	83.0	95.5	69.6	93.1	82.7
+ LoMiX (Ours)	83.6	16.6	74.6	88.9	70.3	89.4	85.2	94.8	67.8	89.4	83.0
PVT‑EMCAD‑B2 + LL	80.9	22.9	71.2	87.1	68.0	84.9	81.1	94.6	63.1	89.8	78.9
+ DS	82.9	19.7	73.8	87.4	67.8	87.7	83.7	95.2	65.6	91.5	84.2
+ MUTATION	83.6	15.7	74.7	88.1	68.9	88.1	84.1	95.3	68.5	92.2	83.9
+ LoMiX (Ours)	85.1	14.9	76.4	88.8	73.5	89.1	84.7	95.8	69.7	92.5	86.5

Q3 & W2. Novelty and claim of this paper...

Response: We respectfully disagree with this assessment, as we propose:

A new supervision methodology, not a new architecture. Prior works [1, 2] aggregate high-dimensional features inside the network and keep those aggregation steps at inference. In contrast, LoMiX operates at the logit/loss level, during training only. We synthesize intermediate resolution prediction maps (class logits) from existing multi-stage/resolution decoder outputs only during training, thus having zero extra parameters and latency overhead at inference. Our focus is to optimize supervision for any U-shaped network.
Operator-diverse combinatorial logit fusion: Prior works [1,2] fuse a fixed set of multi-scale wide feature tensors. In contrast, LoMiX systematically considers all non‑empty subsets of multi-scale decoder logits and applies a set of operators (Add, Mult, Concat, AWF) to generate a rich set of intermediate resolution predictions. This exhaustive, resolution- and operator-diverse logit space has never been explored for segmentation supervision. This makes our contribution unique and worthwhile.
Fully automatic, NAS-inspired loss weighting: Rather than hand‑tuning or using uniform loss weights, we introduce a NAS‑inspired Softplus weighting mechanism that learns how much each original/fused logit should contribute to the loss. This way, dozens of fusion paths are jointly optimized, thus making the combinatorial space tractable.
Data-scarce medical segmentation: Our largest gains appear where supervision quality matters most, i.e., in limited-data settings which is common in practical medical imaging. Prior work [1,2] targets large-scale natural-image detection with abundant labels. We show that rethinking supervision (instead of architecture) yields consistent DICE improvements across diverse medical datasets/networks with limited annotated data.
Thorough analyses: We provide operator/subset ablations (Fig. 3 and Supplementary Table S6), weight-dynamics curves (Supplementary Figs. S1-S8), per-organ results (Tables 1-2, and Supplementary Tables S1, S3, S5-S8), and cross-architecture studies (Fig. 5 and Supplementary Table S8). This demonstrates generality and interpretability rather than an on-off tweak.

To mitigate reviewer concerns, we will discuss [1,2] in the related work, and rephrase the claim in line 121: "In contrast, to the best of our knowledge, LoMix is the first approach to enable fully learnable multi-scale logit fusion for improved supervision in segmentation."

W3. Compare NAS-based loss weights with other dynamic loss methods.

Response: The "dynamic loss" schemes (GradNorm [9], Uncertainty weighting [10], etc.) target multi‑task settings with a few heterogeneous losses. In contrast, our LoMiX tackles a single task (segmentation) with dozens of highly correlated, multi‑scale logit losses synthesized using our multi-operator combinatorial fusion. Consequently, we designed a new NAS‑inspired Softplus weighting mechanism that (i) assigns each path an independent positive scalar, (ii) requires no global normalization or manual weights, and (iii) scales well to many multi-scale loss signals, which is exactly what we need in our setup. To keep the evaluation focused, we compared against the closest relevant baselines for our setting: multi‑scale/deep supervision with fixed or uniform weights (e.g., deep supervision [7], MUTATION [18]). Those baselines directly isolate the effect of making supervision learnable, which is our core contribution.

In principle, there is no fundamental barrier to plugging in multi‑task dynamic methods; however, they are simply not the right comparisons for a single‑task, multi-scale loss supervision problem. If the reviewer still deems such a comparison essential, we will include it in the revised paper.

References

[1] Liu, S., et al. Path aggregation network for instance segmentation. CVPR, 2018.

[2] Zhao, Q., et al. M2det: A single-shot object detector based on multi-level feature pyramid network. AAAI, 2019.

[7] Zhou, Z., et al. Unet++: Redesigning skip connections to exploit multiscale features in image segmentation. TMI, 2019.

[8] Lee, C.Y., et al. Deeply-supervised nets. AIS, 2015.

[9] Chen, Z., et al., Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. ICML, 2018.

[10] Kendall et al., Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. CVPR, 2018.

2025-08-06

Dear Reviewer h4U4,

Thank you for your thoughtful review and constructive feedback. We have addressed each of your concerns in our rebuttal:

Clarified novelty (see Q1, Q3)
- Differentiated LoMiX from other dynamic weighting (see Q1.1) and late-fusion methods (see Q1.2).
- Stated explicitly that LoMiX is the first fully learnable, multi-operator, combinatorial logit-fusion supervision framework (see Q3).
- Rephrased the main claim (see Q3)
New cross-architecture experiments and restructured Table 1 (see Q2)
- Conducted new experiments with LoMiX on UNet, TransUNet, PVT-EMCAD, and other networks.
- Table 1 now groups results by network, clearly showing LoMiX’s gains over Last-Layer, Deep Supervision, and MUTATION for every network (see Table R6 above).
Additional analyses and citations
- Conducted new preliminary experiments on 3D segmentation to show the feasibility of LoMiX (see Table R4 above).
- Conducted new experiments to show the cross-dataset generalizability (see Table R5 above).
- Conducted statistical test to confirm significance of our results (see Table R2 above).
- Cited and contrasted late-fusion detectors (PANet [1] and M2det [2]) in Related Work.

We hope these responses and revisions fully resolve your concerns. We are looking forward to your post-rebuttal feedback and would be happy to address any additional concern.

Sincerely,

The Authors

2025-08-06

Thanks for the detailed explanation. The author provides a clear explanation of the late fusion part (which is not a new module).

However, I still find it hard to identify the fundamental novelty of the learnable weights of the loss term. Basically, what the author introduces in this paper is a method to get $TotalLoss = w_1 * loss_1 + w_2 * loss_2 + w_3 * loss_3 + ...$ , where the $w_{i}$ are learnable parameters (see eq. 6 and eq. 9), and the losses come from off-the-shelf operators. The whole process is trivial.

On the other hand, even though the author claims to be handling a 'single task (segmentation) with dozens of highly correlated, multi-scale logit losses synthesized using our multi-operator combinatorial fusion,' the weighting process does not provide a specific design for that. The author just treats the problem as a multi-loss problem, regardless of whether the losses are 'highly correlated' or 'multi-scale.'

(i) It 'assigns each path an independent positive scalar.' Since the paper is dealing with multiple loss terms, I think an independent positive scalar is the default option. (ii) requires no global normalization or manual weights. " This is basically what nn.Parameters are doing. (iii) It 'scales well to many multi-scale loss signals.' I can't see a specific multi-scale loss weight design in Eq. 6, 7, 8, or 9."

Overall, based on the parts of "A new supervision methodology", "Operator-diverse combinatorial logit fusion", and "Thorough analyses", I would raise my rating to 3. Still, I think the whole framework is trivial for NeurIPS.

2025-08-07

Dear Reviewer h4U4,

Thank you for your feedback. We are happy to have clarified the late fusion issue. Regarding the novelty, the learnable weighting mechanism we propose is a new end-to-end, NAS $^a$ -inspired search mechanism over the loss weights with the following benefits:

Dynamic, data-driven approach: During training, the optimizer searches the entire weights space: early epochs assign larger (NAS-inspired Softplus) weights to high-level (coarse) decoder original or fused logits; later epochs shift weights toward the low-level (fine-scale) logits. This dynamic focus cannot be matched with fixed, convex (sum-to-one), or schedule-based weights used by standard approaches as shown by our results (see Rebuttal Table R6 above, Figures 4-5, and Supplementary Tables S7-S8).
Enhanced diversity: Our fusion step produces $2^L – L – 1$ multi-scale fused logits for each operator. Learning their weights dynamically selects the most complementary signals and avoids redundancy due to NAS approach. Consequently, our LoMiX consistently outperforms classical deep-supervision [7] and uniformly weighted MUTATION [18].
Principled, safe design: Each weight is passed through NAS-inspired Softplus gate (Equation 6), thus guaranteeing a positive value. This simple constraint keeps optimization stable while letting useful original or fused logits dominate.
Zero inference cost, broad utility: LoMiX adds no layers or latency during inference, but improves every network (e.g., UNet, Attention UNet, TransUNet, PVT-CASCADE, UNeXt, PVT-EMCAD) we tested. It is therefore a practical, widely applicable supervision method rather than a trivial tweak; virtually, any SOTA can adopt and benefit from our idea.

In summary, we turn the selection from a combinatorial set of multi-scale logits into a NAS-inspired learnable weighting problem that can be solved very efficiently during training, something prior methods do not offer. For all these reasons, we believe our idea is a solid, non-trivial contribution worthy of NeurIPS.

$^a$ By design, the Neural Architecture Search (NAS) automatically finds the best architecture for a specific task [11-13].

[11] Liu, H., et al. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.

[12] Elsken, T., et al. Neural architecture search: A survey. JMLR, 2019.

[13] Li, G., et al. Zero-Shot Neural Architecture Search: Challenges, Solutions, and Opportunities. T-PAMI, 2024.

Thank you again for your constructive review. We would be happy to provide any additional information.

Sincerely,

The Authors

2025-08-09

Dear Reviewer h4U4,

Thank you once again for your time in reviewing our paper. Based on the clarifications provided, we believe your concerns about the learnable weighting have been fully addressed. We would be happy to provide any additional information.

Sincerely,

The Authors

审稿意见

评分: 5置信度: 42025-07-01

This paper proposes the LoMix (Logits Mixing) method, which significantly improves the performance of medical image segmentation through a learnable multi-scale feature fusion strategy. Its core innovation lies in the introduction of four lightweight fusion operations (addition, multiplication, concatenation, and attention-weighted fusion) combined with a NAS-style weight learning mechanism to dynamically optimize the contributions of features at different scales. Experiments show that LoMix achieves significant performance gains across multiple datasets and backbone networks, and performs particularly well in data-scarce scenarios. The method does not need to increase the inference overhead, is efficient and versatile, and provides a new technical idea for medical image segmentation.

优缺点分析

Strengths:

A learnable multi-scale Logits mixing strategy is proposed for the first time, which dynamically combines the decoder outputs of different scales through four lightweight fusion operations to generate rich "mutant" feature maps.
A NAS-style weight learning mechanism is introduced to automatically optimize the loss weights for different fusion methods and scales, without the need for manual parameter tuning or additional validation set search.
The proposed method achieves a plug-and-play design with zero inference overhead, introducing computation only in the training phase without affecting the model inference speed or memory footprint, which is suitable for practical deployment.
The proposed method provides a certain degree of generality and interpretability. It is compatible with CNN and Transformer backbone networks (e.g., PVT, ResNet), and the learned weights can intuitively explain the contributions of different scales.
The effectiveness of the proposed method is fully validated in multiple sets of experiments conducted on multiple datasets. The experimental results show that the proposed method outperforms SOTA methods on several medical image segmentation tasks.

Weaknesses:

The paper does not discuss the convergence of multi-scale fusion optimization. For example, are the weights of all fusion operations eventually stable? Are there redundancies?
Different image modalities (e.g., ultrasound vs. CT) may require different fusion strategies, but the paper does not discuss parameter migration or modality adaptation approaches.
Medical image analysis often requires 3D segmentation, but the paper has not yet verified the scalability and capability of LoMix on 3D architectures.
The experiments were based only on publicly available benchmarks, and the generalization ability of LoMix on the data from different acquisition devices or hospitals has not been tested, which may conceal the risk of overfitting.

问题

See Weaknesses 1 and 2

局限性

See Weaknesses 3 and 4

最终评判理由

The authors have addressed most of my concerns. Regarding the generalization ability of LoMix, although the results in Table R5 show overfitting, the performance of the method proposed in the paper is still superior to other methods. After comprehensively considering the authors' responses and the quality of the paper itself, I am inclined to maintain my initial score.

格式问题

None

作者回复

2025-07-31

We sincerely thank the reviewer for the constructive feedback. Below, we address each concern and outline how we plan to incorporate these suggestions in our revised manuscript.

Q1: The paper does not discuss the convergence of multi-scale fusion optimization. For example, are the weights of all fusion operations eventually stable? Are there redundancies?

Response: Our multi‑scale fusion does converge and self‑suppresses redundancies. As the Softplus weights are co‑optimized with the segmentation model, absolute values drop with the overall loss, so after a short warm‑up their relative ordering stabilizes (see Supplementary Figs. S1–S8). Two consistent patterns emerge:

By scale: Low‑resolution, high‑level logits (early decoder stages) are driven to relatively much smaller weights, while high‑resolution, fine‑grain logits (later stages) retain larger weights, indicating that less informative logits are automatically down‑weighted.
By operator: AWF and Add fusions receive comparatively higher weights, Mult gets moderate weights, and Concat gets the lowest weights in later epochs (see Supplementary Fig. S1–S8). Still, even these low‑weighted Concat fused logits matter; our ablations (Fig. 3 and Supplementary Table S6) show that they supply complementary segmentation cues that improve DICE despite their small scalar values.

Q2. Different image modalities (e.g., ultrasound vs. CT) may require different fusion strategies, but the paper does not discuss parameter migration or modality adaptation approaches.

Response: We agree that ultrasound, CT, MRI, and endoscopy have very different characteristics. In fact, LoMiX is designed precisely to learn such differences rather than hard‑code a single fusion recipe. Our Softplus loss weights are co‑optimized with the segmentation network, so the model automatically explores the full fusion space and down‑weights the less useful fused logits while relatively upweighting the informative ones for each dataset/modality. In practice, all weights numerically shrink as the overall loss decreases, but their relative pattern stabilizes: low‑resolution/early‑stage logits and certain operators (e.g., Concat) receive relatively much smaller final weights, while high‑resolution logits and AWF/Add remain influential; this is evidence of data‑driven adaptation to various datasets/modalities (see Supplementary Figs. S1-S8).

Because this weighting is learned jointly, no separate “parameter migration” or modality-specific tuning is needed: training (or fine‑tuning) on a new modality simply yields a different weight configuration. We already demonstrate this across heterogeneous modalities, such as CT organs (see Table 1), cardiac MRI (Table 2), ultrasound (Table S2), skin lesions (Table S2), and colon polyps (Table S4), without changing the architecture or hand‑tuning weights. We will make this modality‑adaptation mechanism more explicit in the camera-ready version. Equally important, LoMiX works at training‑time, so clinical deployment is unaffected (i.e., zero extra parameters or latency overhead during inference).

W1. Medical image analysis often requires 3D segmentation, but the paper has not yet verified the scalability and capability of LoMix on 3D architectures.

Response: Generalizing LoMiX to 3D is straightforward: simply replace the 2D convs and bilinear upsampling with 3D convs and trilinear upsampling, everything else remains unchanged. This is because we fuse predictions (class logits), not wide feature tensors: our goal is to synthesize intermediate "pseudo‑ensembled" predictions during training for richer supervision, with zero inference overhead.

Adding all non‑empty subsets of the $L$ decoder predictions introduces only $2^L−L−1$ extra logit maps (e.g., $11$ for $L=4$ , $26$ for $L=5$ ); since U‑shaped networks rarely exceed five stages, these maps are finite, lightweight, and used only during training, so the incremental training compute/memory cost remains feasible and inference is still zero‑overhead. When GPU memory is low, either in 2D or 3D, standard optimizations such as gradient checkpointing or caching logits on CPU further reduce compute requirements without altering the algorithm, thus keeping LoMiX practical on modest hardware (< 5 GB extra GPU memory required for backpropagation to process a 96×96×96 volume with a four-stage network and 9 output classes). To show the feasibility of LoMiX with 3D segmentation networks, the priliminary results of 3D MedNeXt with LoMiX are reported in Table R4 below.

Table R4: MedNeXt's performance on 3D Synapse 8-organ segmentation with Last Layer (LL), Deep Supervision (DS), MUTATION, and LoMiX. DICE scores (%) are reported for Gallbladder (GB), Left kidney (KL), Right kidney (KR), Pancreas (PC), Spleen (SP), and Stomach (SM). Our results (in bold) demonstrate that LoMiX achieves the best average DICE and HD95 scores among all supervisions.

Methods	Avg. DICE	Avg. HD95	Aorta	GB	LK	RK	Liver	PC	SP	SM
MedNeXt-M_K3 + LL	86.22	6.62	91.96	75.59	87.11	85.14	96.81	78.31	91.26	83.61
MedNeXt-M_K3 + DS	86.63	8.31	91.61	78.91	90.40	85.94	94.72	74.95	92.10	84.43
MedNeXt-M_K3 + MUTATION	86.84	6.04	91.72	79.94	90.97	86.62	96.58	76.65	90.55	81.69
MedNeXt-M_K3 + LoMiX (Ours)	87.19	4.84	91.81	79.87	90.54	86.65	96.68	76.95	90.63	84.37

W2: The experiments were based only on publicly available benchmarks, and the generalization ability of LoMix on the data from different acquisition devices or hospitals has not been tested, which may conceal the risk of overfitting.

Response: We agree that clinical deployment demands robustness across scanners and sites. Our current evaluation already addresses overfitting risk by spanning multiple, heterogeneous public datasets and modalities (i.e., abdominal CT, cardiac MRI, breast ultrasound, dermoscopy, colonoscopy), each collected with different protocols and devices. Yet LoMiX improves on every dataset without manual tuning.

Because LoMiX is training‑only, parameter‑efficient, and does not increase model capacity (e.g., parameters and FLOPs), it actually behaves like a regularizer. In fact, LoMiX implicitly ensembles diverse multi‑scale logits only during training, thus reducing the chance of overfitting to dataset‑specific biases.

To make this explicit, in the revised paper, we will include a cross‑dataset experiment (e.g., train on one dataset/hospital, and test on another) as shown in Table R5 below.

Table R5: Cross‑dataset/hospital generalization of LoMiX. Using the PVT‑EMCAD‑B2 network, all models are trained for 200 epochs on the Kvasir polyp‑segmentation training set (900 images); the epoch with the best DICE (%) on the Kvasir validation split (100 images) is saved. Then we evaluated the generalizability on three external test sets (CVC‑ClinicDB, CVC‑ColonDB, ETIS‑LaribPolypDB), without further tuning. LoMiX produces the highest DICE on every dataset which confirm its superior ability to generalize across hospitals and acquisition devices.

Methods	CVC‑ClinicDB	CVC‑ColonDB	ETIS‑LaribPolypDB
Last Layer	80.59	75.31	71.64
Deep Supervision	81.87	76.16	75.84
MUTATION	81.88	76.39	75.97
LoMiX (Ours)	83.08	77.70	77.01

2025-08-07

Dear Reviewer xQqf,

We sincerely thank you for the thoughtful follow-up feedback. Our responses are given below.

We beleive that the performance drop you observe is mainly a consequence of domain shift rather than classical over-fitting. CVC-ClinicDB, CVC-ColonDB and ETIS-LaribPolypDB are collected with different endoscopes, illumination settings and bowel-prep protocols, so every method (including ours) loses DICE when tested on out-of-distribution datasets. With our LoMiX, on Kvasir dataset itself the checkpoint selected by validation DICE (93.51%) attains 93.45% on the held-out Kvasir test split (Supplementary Table S4), indicating that the model generalizes well within the source domain.

Importantly, LoMiX narrows the cross-center gap better than competing supervisions: it raises DICE by +2.4% to +5.4% on all three external datasets (over the baseline last-layer supervision in Table R5), thus demonstrating that the guidance from the extra multi-scale fused logits improves generalization.

To clarify this in the revised paper, we will conduct an in depth analysis by (i) adding train/validation/test curves for Kvasir to visualize the tight in-domain correlation and (ii) including a discussion of the specific factors driving the cross-hospital gap and why LoMiX helps.

Thank you again for your constructive review. We would be happy to provide any additional information.

Sincerely,

The Authors

2025-08-06

Thank you for your thoughtful responses. I believe that the performance decline (compared to the test results on the Kvasir dataset itself) shown in Table R5 indicates a certain degree of overfitting, and I recommend that the authors conduct a more in-depth analysis and discussion of this issue. Nevertheless, the method described in the paper still achieves better results than other methods. Based on the authors' detailed responses and the quality of the original manuscript, I am inclined to maintain my initial score.

审稿意见

评分: 5置信度: 42025-07-02

The authors introduce LoMix, a method for enhancing 2D semantic medical image segmentation through mixed-scale fusion of decoder features. LoMix combines multi-scale decoder logits using operations such as addition, multiplication, concatenation, and attention-based fusion within a Combinatorial Mutation Module. A Neural Architecture Search (NAS)-inspired strategy co-optimizes a softplus-weighted loss alongside model parameters to adaptively select effective fusion strategies and scales. The approach is integrated into UNet-like architectures (e.g., PVT-V2-B2 with EMCAD decoder) and evaluated on Synapse and ACDC datasets, including low-data settings and extensive ablations on fusion operations, NAS weighting, and backbone choices.

优缺点分析

Strengths:

Data-driven and adaptive fusion strategy: Unlike traditional deep supervision or fixed mutation-based fusion methods, LoMix introduces a dynamic mechanism that learns how to fuse multi-scale decoder logits during training. By leveraging a combinatorial set of operations (addition, multiplication, concatenation, attention) and jointly optimizing them through a NAS-inspired weighting scheme, LoMix adapts the fusion process to the specific characteristics of the data. This makes the method flexible to the semantic granularity and scale variance present in medical images, enabling a more informed, data-driven integration of decoder features instead of relying on manually designed fusion heuristics.
Robustness in limited data regimes: Medical image segmentation often suffers from scarce annotated data due to the high cost of expert labeling. LoMix demonstrates strong performance even under constrained training data conditions. This robustness stems from its ability to dynamically prioritize the most informative scales and fusion paths, effectively regularizing the learning process and mitigating overfitting. This characteristic makes LoMix particularly practical for real-world clinical settings where annotation budgets are limited.

Weaknesses

While the NAS-inspired loss weighting partially addresses the supervision imbalance across the combinatorially generated fused logits, all fusion paths are still computed and backpropagated regardless of their utility. This results in significant computational overhead, especially as the number of decoder stages increases. Moreover, some fusion operators, particularly multiplicative fusion, remain sensitive to low-confidence predictions early in training, potentially introducing unstable gradients before the loss weights converge to suppress them.
The authors state (line 256) that LoMix helps in segmenting small and challenging abdominal organs, yet no analysis is provided to visualize which fusion types or decoder scales contribute most to those improvements. This would strengthen the claim and help explain LoMix’s behavior. Additionally, in Table 2, the performance gain between PVT-EMCAD-B with and without LoMix appears modest, and no statistical significance analysis is presented to support the observed differences. Also, unlike Table 1 (Synapse), Table 2 omits HD95 and mIoU metrics, which would be relevant for assessing organ boundary quality and overall segmentation agreement, especially since table space permits their inclusion.
Although the authors acknowledge that LoMix is designed for 2D images and they recognize the weakness, all datasets used in experiments are intrinsically 3D. This mismatch between method and data dimensionality may limit its applicability in clinical practice unless a 3D-compatible version is developed.

问题

While the NAS-inspired loss weighting helps mitigate supervision imbalance across fusion paths, all paths are still computed and backpropagated during training. Can the authors justify this design in light of the significant computational overhead?
The authors claim LoMix improves small and challenging organ segmentation. Can the authors provide any analysis or visualizations showing which decoder scales or fusion types (e.g., concatenation, multiplication) are most responsible for these improvements?
In Table 2, the performance differences between PVT-EMCAD-B with and without LoMix appear small. Have the authors conducted any statistical tests to confirm that these improvements are significant?
Why are HD95 and mIoU not reported for the multi-organ segmentation results in Table 2, despite being included for Synapse in Table 1? Including them would allow more complete comparisons.

局限性

The authors adequately addressed the limitations and potential negative societal impact of their work

最终评判理由

The authors have extensively addressed all of my comments through detailed explanations and additional experiments. I have also reviewed the concerns raised by other reviewers. While I partially agree with the two remaining weaknesses highlighted by reviewers xQqf (overfitting) and vjqb (memory requirements), I have no further concerns from my side. I therefore maintain my score as 5 (Accept).

格式问题

No Paper Formatting Concerns

作者回复

2025-07-31

Official Response to Reviewer 5cjt

We sincerely thank the reviewer for the constructive feedback. Below, we address each concern and outline how we plan to incorporate these suggestions in our revised manuscript.

W1.1 & Q1. While the NAS-inspired loss weighting partially addresses the supervision imbalance across the combinatorially generated fused logits, all fusion paths are still computed and backpropagated regardless of their utility. This results in significant computational overhead, especially as the number of decoder stages increases. Can the authors justify this design in light of the significant computational overhead?

Response: Thank you for the thoughtful critique. Why keeping all the fusion paths?

We deliberately retain and back-propagate all fusion paths for three intertwined reasons:

Diversity + automatic weighting: We deliberately enumerate all non‑empty subsets to maximize logit diversity as different operator/subset pairs emphasize different spatial scales and error modes. Our NAS‑inspired Softplus weights then assign distinct, learned contributions to each fusion path; the unhelpful fusion paths are automatically down‑weighted, while the useful ones retain influence. Pruning too early would remove good candidates before the optimizer can assess them.
Computational overhead is modest and training‑only: All fusions operate on C‑channel class logits (C = #classes, typically small), not on wide feature tensors. Even with 4–5 decoder stages, the extra tensors are at most a few dozen small maps and a few 1×1 (or 1×1×1) convs. In practice, this adds only to the training FLOPs/memory, while inference remains unchanged (i.e., zero overhead). When GPU memory is low for training, standard optimizations such as gradient checkpointing or caching logits on CPU further reduce compute/memory requirements without altering the algorithm, thus keeping LoMiX practical on modest hardware.
Optional pruning is easy: If desired, one can prune paths whose weights stay near zero after a short warm‑up or sample a top‑K set; this is orthogonal to LoMiX and straightforward to add.

To sum up, computing and backpropagating all paths is a conscious design choice to ensure maximal supervision diversity while letting the NAS‑inspired weighting learn which fused logits matter most. The added cost is small, confined to training, and the result is consistent performance gains without any inference overhead.

W1.2. Moreover, some fusion operators, particularly multiplicative fusion, remain sensitive to low-confidence predictions early in training, potentially introducing unstable gradients before the loss weights converge to suppress them.

Response: We multiply logits, not probabilities, and initialize all Softplus weights to small values, so no path dominates early. If an operator does inject noisy gradients, its weight shrinks. Empirically, our training is stable (see weight dynamics in Supplementary Figs. S1-S8), and removing multiplicative paths can hurt accuracy (see operator ablations in Fig. 3 and Supplementary Table S6), thus indicating they provide complementary information if properly weighted.

W2.1 & Q2. The authors state (line 256) that LoMix helps in segmenting small and challenging abdominal organs, yet no analysis is provided to visualize which fusion types or decoder scales contribute most to those improvements. This would strengthen the claim and help explain LoMix’s behavior. Can the authors provide any analysis or visualizations showing which decoder scales or fusion types (e.g., concatenation, multiplication) are most responsible for these improvements?

Response: We do report already organ‑wise DICE scores for every operator combination in Table S6 and the learned weight dynamics in Supplementary Figs. S1-S8. Together, these results show that paths involving the higher-resolution decoder stages and AWF/Mult operators consistently correlate with the largest gains on small/low‑volume abdominal organs, the scenario highlighted in line 256-257 (Section 4.2).

To make this linkage more clear, we will revise the camera-ready version as follows:

Add clear cross‑references in the main text from the small‑organ claim to Table S6 and Supplementary Figs. S1-S8.
Provide qualitative overlays (baseline vs. LoMiX) on representative small organs to visually substantiate the quantitative trend.
Add visualization of synthesized/fused predictions.

These revisions will directly connect specific fusion paths and scales to the observed boosts on challenging organs, thus strengthening both the interpretability and credibility of our claim.

W2.2. & Q3. Additionally, in Table 2, the performance gain between PVT-EMCAD-B2 with and without LoMix appears modest, and no statistical significance analysis is presented to support the observed differences. Have the authors conducted any statistical tests to confirm that these improvements are significant?

Response: Although the average DICE score gain on PVT‑EMCAD‑B2 looks small, it is consistently and statistically reliable (see Table R2 below). Using a two‑sided Wilcoxon signed‑rank test over 12 test scans (Holm‑corrected), LoMiX is significantly better than every training baseline: MUTATION (p = 0.0122), Deep Supervision (p = 0.0049), and Last Layer (p = 0.0015). We will report these p‑values in main Table 1 to show significance. Again, this improvement comes at zero inference overhead.

Table R2: DICE (%) on 12 test scans of BTCV 8-organ segmentation with PVT‑EMCAD‑B2. p-values (two‑sided Wilcoxon signed‑rank [3], Holm‑corrected [4]) compare each baseline to LoMiX; all are < 0.05, indicating LoMiX’s improvement is statistically significant despite the modest mean gap.

Method	DICE (%)	p‑value (vs. LoMiX, Holm‑corrected)	Significant (α = 0.05)
LoMiX (Ours)	85.07	-	-
MUTATION	83.63	0.0122	Yes
Deep Supervision	82.90	0.0049	Yes
Last Layer	80.94	0.0015	Yes

W2.3. & Q4. Also, unlike Table 1 (Synapse), Table 2 omits HD95 and mIoU metrics, which would be relevant for assessing organ boundary quality and overall segmentation agreement, especially since table space permits their inclusion. Why are HD95 and mIoU not reported for the multi-organ segmentation results in Table 2, despite being included for Synapse in Table 1? Including them would allow more complete comparisons.

Response: We agree that HD95 and mIoU are informative for boundary quality and overall agreement. They were omitted from the initial Table 2 to stay consistent with the EMCAD paper’s [5] reporting protocol, but we will definitely include both metrics in the revised paper.

W3. Although the authors acknowledge that LoMix is designed for 2D images and they recognize the weakness, all datasets used in experiments are intrinsically 3D. This mismatch between method and data dimensionality may limit its applicability in clinical practice unless a 3D-compatible version is developed.

Response: LoMiX is dimension‑agnostic: it operates on C‑channel class logit maps at the loss level. Extending LoMix to 3D is straightforward: simply replace Conv2d with Conv3d and bilinear with trilinear upsampling; the fusion/weighting logic remains identical and the cost still scales with the (small) number of decoder stages (rarely exceeding five stages). When GPU memory is low, either in 2D or 3D, standard optimizations such as gradient checkpointing or caching logits on CPU further reduce compute/memory requirements without altering the algorithm, thus keeping LoMiX practical on modest hardware (< 5 GB extra GPU memory required for backpropagation to process a 96×96×96 volume with a four-stage network and 9 output classes).

Of note, we also reported purely 2D tasks (ISIC2018 skin lesions; Kvasir/CVC-ColonDB/ETIS-LaribPolypDB polyps in Supplementary Tables S2 and S4). For the intrinsically 3D datasets, we followed the commonly used memory‑efficient slice‑based protocol as in [5,6], while computing metrics on full volumes. To show the feasibility of LoMiX with 3D segmentation networks, the new preliminary results of 3D MedNeXt with LoMiX are reported in Table R3 below.

Table R3: MedNeXt's performance on 3D Synapse 8-organ segmentation with Last Layer (LL), Deep Supervision (DS), MUTATION, and LoMiX. DICE scores (%) are reported for Gallbladder (GB), Left kidney (KL), Right kidney (KR), Pancreas (PC), Spleen (SP), and Stomach (SM). Our results (in bold) demonstrate that LoMiX achieves the best average DICE and HD95 scores among all supervisions.

Methods	Avg. DICE	Avg. HD95	Aorta	GB	LK	RK	Liver	PC	SP	SM
MedNeXt-M_K3 + LL	86.22	6.62	91.96	75.59	87.11	85.14	96.81	78.31	91.26	83.61
MedNeXt-M_K3 + DS	86.63	8.31	91.61	78.91	90.40	85.94	94.72	74.95	92.10	84.43
MedNeXt-M_K3 + MUTATION	86.84	6.04	91.72	79.94	90.97	86.62	96.58	76.65	90.55	81.69
MedNeXt-M_K3 + LoMiX (Ours)	87.19	4.84	91.81	79.87	90.54	86.65	96.68	76.95	90.63	84.37

References

[3] Wilcoxon, F. Individual comparisons by ranking methods. Biometrics bulletin, 1945.

[4] Holm, S. A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics, 1979.

[5] Rahman, M.M., et al. Emcad: Efficient multi-scale convolutional attention decoding for medical image segmentation. CVPR, 2024.

[6] Chen, J., et al. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306, 2021.

2025-08-04

Thank you to the authors for the detailed and thoughtful rebuttal. I found the response insightful and appreciate the inclusion of additional experiments. I have a few follow-up questions, motivated by a genuine interest in further understanding the underlying behaviors.

Thank you for confirming LoMiX’s compatibility with 3D networks and for providing preliminary results using MedNeXt on the Synapse dataset. As a follow-up:

Given that LoMiX’s fusion and weighting mechanisms are dimension-agnostic, I’m curious whether you have observed any distinctive patterns in the dynamics of operator weights (e.g., Add, Mult, Concat, AWF) when training in 3D settings with MedNeXt compared to 2D cases. Specifically, have you noticed any trends that might suggest LoMiX is sensitive to the underlying nature of volumetric data, such as spatial continuity, anisotropy, or sparsity, potentially influencing how different fusion operators are prioritized during training?

For instance:

Does multiplicative fusion become less dominant in 3D due to increased spatial sparsity and volume anisotropy?
Does AWF retain its relative importance, or behave differently due to increased spatial context and receptive field depth?
Have you observed any temporal delay or flattening in the decay dynamics of redundant fusion paths, perhaps due to longer convergence times in 3D?

2025-08-05

We sincerely thank the reviewer for the thoughtful follow-up questions. Our responses are given below.

Given that LoMiX’s fusion and weighting mechanisms are dimension-agnostic, I’m curious whether you have observed any distinctive patterns in the dynamics of operator weights (e.g., Add, Mult, Concat, AWF) when training in 3D settings with MedNeXt compared to 2D cases…

The operator importance ranking is different in 3D (with Mul and Add generally leading, AWF close behind, and Concat contributing the least). Each operator’s role appears to translate meaningfully into 3D (e.g., the multiplicative path focuses on regions of strong agreement among decoder stages, AWF incorporates weighted contributions among decoder stages).

We also observed differences in training dynamics: 3D requires longer training for the fusion weights to decay, so all fusion modes stay relevant longer into training. Overall, these findings give us confidence that LoMiX is robust in volumetric settings and that it automatically learns to prioritize fusion operators in a way that aligns with the nature of 3D data without the need for any hand-tuning to address anisotropy or continuity. We hope the detailed analysis below clarifies how LoMiX seamlessly adapts to 3D scenarios.

Does multiplicative fusion become less dominant in 3D due to increased spatial sparsity and volume anisotropy?

No. In fact Mul fusion becomes more dominant in our experiments on Synapse dataset with 3D MedNeXt network. The best validation point cumulative softplus weights per-operators in 3D: Mul = 0.0019, Add = 0.0013, AWF = 0.0012, Concat = 0.0010, while in 2D we have: AWF = 0.0034, Add = 0.0020, Mul = 0.0017, Concat = 0.0006.

The reason multiplicative fusion becomes dominant in 3D is likely due to volumetric spatial continuity, which allows it to excel at highlighting consistent signals across slices. In 3D, organs and structures span multiple slices, so when multiple decoder logits strongly agree on a region, the multiplicative fusion amplifies that agreement. This effectively acts like an “AND” gate, filtering out background noise by only passing through logits that are strong in multiple decoder stages – a valuable property given the abundance of background voxels in large 3D scans.

We observed that the data anisotropy (the coarser 2.0 mm resolution in z-axis) did not hamper this effect: LoMiX’s gating adjusted to the slice spacing, and we did not observe any noticeable reduction in Mul’s contribution due to the anisotropy. The 3D convolutional layers likely learned to compensate for the uneven resolution, so features still align well across axes – thus the multiplicative fusion could reliably detect consistent 3D logits.

Does AWF retain its relative importance, or behave differently due to increased spatial context and receptive field depth?

Yes, but only relatively. In our 2D experiments on Synapse dataset with PVT-EMCAD-B2 network, AWF received the highest weight (0.0034), while in 3D experiments on Synapse dataset with MedNeXt network, it gave up the lead to Mul (0.0019), yet still retained a comparable weight to Add (0.0013 vs. 0.0012). AWF therefore remains important for fine-grained channel re-weighting, but 3D data encourage the model to lean more on gating (Mul) for coarse, volume-consistent cues. In the end, AWF continues to serve as a complementary fusion path that enhances the fused logit maps in 3D volume, and moving to deeper volumetric data only changes its relative weight, not completely diminishes its role.

Have you observed any temporal delay or flattening in the decay dynamics of redundant fusion paths, perhaps due to longer convergence times in 3D?

Yes. In 2D experiments on Synapse dataset with PVT-EMCAD-B2 network, the low-value operators (especially Concat) decayed to ≈ $0.0006$ by epoch 300; in 3D experiments on Synapse dataset with MedNeXt network, the same operator still carried 0.0010 at the validation peak (by epoch 2416). The more complex 3D feature space evidently requires extra iterations before LoMiX is confident enough to zero-out certain paths. When training was extended to more epochs, the 3D weights continued to shrink, thus mirroring the eventual 2D sparsity. This extended period during which all fusion types remained engaged suggests that LoMiX in 3D might be exploring a richer combination of fusion strategies and only prunes out the ones it doesn’t need once the model has seen sufficient volumetric data to be sure on its decisions. From a performance perspective, this means that at the optimal stopping point our 3D model was still leveraging all four fusion operators to some degree, which likely helped it maximize accuracy.

Of note, LoMiX was originally designed for 2D segmentation. Building on this, our future work will focus on designing a dedicated 3D LoMiX variant that explicitly accounts for volumetric anisotropy and 3D network architectures.

2025-08-06

The authors have comprehensively addressed all of my questions. I have no further concerns and would like to sincerely thank them once again for their time and detailed work.

2025-08-07

Dear Reviewer 5cjt,

Thank you once again for your time in reviewing our paper. We are pleased to have addressed all of your concerns and questions. We would be happy to provide any additional information.

Sincerely,

The Authors

审稿意见

评分: 4置信度: 32025-07-03

Neural networks of UNet like shape are widely used in medical image segmentation. These networks include an encoder that processes the input image to a sequence of increasingly abstract representations with lower dimension, and a decoder that combines these representations. Classically when training these neural networks, a signal is generated only by comparing the final layer with the target labels but work on deep supervision has shown that improvements can be made by introducing such a signal for different layers in the network. This work introduces a new kind of supervision method that generates a set of “fusion maps” by up sampling different stages in the decoder and then fuse them together with various operators. The fusion maps are then weighted together and added to the loss. Importantly, the weights are learnable which allows for more flexible supervision compared to previous work.

优缺点分析

Quality: This work is technically sound.

Clarity: This paper is for the most part clearly written and well-structured. However, I think it would be nice to list all of the experiments as tables (some results seem to only be presented in text which makes it hard to find when jumping around in the paper).

Significance: Comparing and evaluating medical image segmentation methods is hard due to lack of a large consensus test suite. This has lead to that very many different papers have been written on architectural improvements of the UNet with claims of improving Dice, but few actually convince the community of improvements good enough to be worth actual implementation. I think this paper unfortunately is likely to fall into this category.

Originality: From what I can tell the leap from [18] doesn’t seem to technically be very high. However, the work is thorough and the results show improvement over the previous results in [18].

问题

Question 1: On line 145: it looks like the number of channels in each decoder layer is constant. Is this a requirement? In that case how would a UNet architecture that increase the number of channels when getting closer to the bottle neck be compatible with the introduced extension?

Question 2: Would it be computationally feasible to generalize the method to a 3D setting? Is there some kind of exponential increase in the number of required fusion maps that gets out of hand?

Question 3: Are the W_S in Equation (3) normalized to 1? I find it surprising that figure S.8 in the supplementals show that addition fused logit maps get weights so much higher than the concat ones, yet still are shown to be of the most important in the ablation study. Is there an explanation to why this is the case?

局限性

Yes.

最终评判理由

My main concern is that the cost of implementing this methodology is so big that it outweighs the benefits. When dealing with image segmentation, particularly in 3D, the GPU memory tradeoff is always present - using a couple of gigabytes for supervision to gain 2% Dice is potentially a couple of gigabytes that could be used for some other technique to gain 2% Dice. However, the authors provided me with results that showed 3D feasible and patiently explained the memory requirements. Even though these requirements sound steep, I think it is compensated by the attractiveness of the method being automatic and in particularly that it does not affect inference. Furthermore, I think that further work could investigate trimming down the memory requirements. Based on the above, I will change my score to borderline accept.

格式问题

Nothing noticed

作者回复

2025-07-31

We sincerely thank the reviewer for the constructive feedback. Our responses are given below.

Q1: On line 145: it looks like the number of channels in each decoder layer is constant. Is this a requirement? In that case how would a UNet architecture that increase the number of channels when getting closer to the bottle neck be compatible with the introduced extension?

Response: In our problem definition (Lines 143–145), C is the number of segmentation classes (C = #classes), not the UNet’s internal feature-channel width. LoMix does not require constant channels in the decoder, i.e., a UNet that widens toward the bottleneck is fully compatible. The only requirement is that each decoder stage output is projected onto a C-channel class logit map at a common resolution so these maps can be mixed. We will rephrase the description to make this clear.

Q2: Would it be computationally feasible to generalize the method to a 3D setting? Is there some kind of exponential increase in the number of required fusion maps that gets out of hand?

Response: Yes, generalizing LoMiX to 3D is straightforward: simply replace the 2D convs and bilinear upsampling with 3D convs and trilinear upsampling, everything else remains unchanged. This is because we fuse predictions (class logits), not wide feature tensors: our goal is to synthesize intermediate “pseudo‑ensembled” predictions during training for richer supervision, with zero inference overhead.

Adding all non‑empty subsets of the $L$ decoder predictions introduces only $2^L−L−1$ extra logit maps (e.g., $11$ for $L=4$ , $26$ for $L=5$ ); since U‑shaped networks rarely exceed five stages, these maps are finite, lightweight, and used only during training, so the incremental training compute/memory cost remains feasible and inference is still zero‑overhead. When GPU memory is low, either in 2D or 3D, standard optimizations such as gradient checkpointing or caching logits on CPU further reduce compute requirements without altering the algorithm, thus keeping LoMiX practical on modest hardware (< 5 GB extra GPU memory required for backpropagation to process a 96×96×96 volume with a four-stage network and 9 output classes). To show the feasibility of LoMiX with 3D networks, the new preliminary results of 3D MedNeXt with LoMiX are reported in Table R1 below.

Table R1: MedNeXt's performance on 3D Synapse 8-organ segmentation with Last Layer (LL), Deep Supervision (DS), MUTATION, and LoMiX. DICE scores (%) reported for Gallbladder (GB), Left kidney (KL), Right kidney (KR), Pancreas (PC), Spleen (SP), and Stomach (SM). Our results (in bold) demonstrate that LoMiX achieves the best average DICE and HD95 scores among all supervisions.

Methods	Avg. DICE	Avg. HD95	Aorta	GB	LK	RK	Liver	PC	SP	SM
MedNeXt-M_K3 + LL	86.22	6.62	91.96	75.59	87.11	85.14	96.81	78.31	91.26	83.61
MedNeXt-M_K3 + DS	86.63	8.31	91.61	78.91	90.40	85.94	94.72	74.95	92.10	84.43
MedNeXt-M_K3 + MUTATION	86.84	6.04	91.72	79.94	90.97	86.62	96.58	76.65	90.55	81.69
MedNeXt-M_K3 + LoMiX (Ours)	87.19	4.84	91.81	79.87	90.54	86.65	96.68	76.95	90.63	84.37

Q3: Are the W_S in Eq. (3) normalized to 1? I find it surprising that figure S.8 in the supplementals show that addition fused logit maps get weights so much higher than the concat ones, yet still are shown to be of the most important in the ablation study. Is there an explanation to why this is the case?

Response: No, $W_S$ in Eq. (3) is the $1\times1$ convolution weight matrix used inside the concat fusion; $W_S$ is a standard linear projection and it is not normalized. The bars in Fig. S8, however, depict the Softplus loss weights w from Eq. (6) for the best-epoch: independent scalar factors that rescale each loss term and are likewise unconstrained, so their magnitudes should not be interpreted as direct "importance" scores.

The apparent mismatch arises because concat’s contribution is better reflected by its complementary signal and training dynamics, not by its final scalar weight. Channel-wise concatenation injects cross‑stage information that additive/multiplicative operators may not capture, which is why ablating concat degrades DICE. Moreover, Fig. S8 shows only the best epoch; earlier curves (Supplementary Figs. S1-2, S6) reveal that concat fused logits receive higher weights during earlier training epochs, helping shape the representation before the optimizer redistributed emphasis. In short, smaller final $w$ does not imply low utility; concat remains important because of the distinct supervision it provides. We will clarify this aspect in our revised paper.

W1: It would be nice to list all of the experiments as tables.

Response: Thank you for the suggestion. All quantitative results that appear as plots in the main paper are already reported as tables in the Supplementary pdf file (i.e., Tables S3, S6-S8 correspond to Figs. 2–5). For the camera‑ready version, we will (i) add explicit cross‑references (in the main text) to these tables, and (ii) move key results into compact tables in the main paper as space permits. This will make it easier to locate the experimental results without having to search through the entire text.

W2. ...very many different papers have been written on architectural improvements of the UNet with claims of improving Dice, but few actually convince the community of improvements good enough to be worth actual implementation. I think this paper unfortunately is likely to fall into this category.

Response: LoMiX is not another UNet variant, but a new supervision strategy at training‑time. To clarify, we do not alter the encoder/decoder blocks, we do not add new stages or resolutions, hence inference remains exactly the same (i.e., no additional parameters, zero latency overhead). Our paper proposes a fundamentally new methodology, rather than a new architecture for efficient and accurate image segmentation.

Getting into more details, during training, LoMiX mixes the existing multi‑stage logit outputs to synthesize additional intermediate predictions, thus enriching supervision without introducing new resolution features. A NAS‑inspired, Softplus weighting scheme then automatically learns how much each original or fused logit should influence the overall loss. This is why our contribution is about how to supervise, not about a new architecture.

The benefit of LoMiX is most evident in limited‑data scenarios (see Fig. 2 and Supplementary Table S3), where supervision is the bottleneck: indeed, by implicitly ensembling stage predictions during training, LoMiX provides richer, more diverse signals and yields consistent DICE gains across various networks and datasets. In other words, rather than contributing with yet another UNet variant, we offer a new, general, architecture‑agnostic supervision strategy that the community can adopt with minimal code changes and no inference overhead.

W3. From what I can tell the leap from [18] doesn’t seem to technically be very high. However, the work is thorough and the results show improvement over the previous results in [18].

Response: LoMiX makes a significant advancement over MUTATION [18] by turning supervision into an automatic, learnable, and architecture‑agnostic process that enables:

Richer fusion space for diverse logit synthesis: Indeed, instead of a single, fixed combinatorial mutation, LoMiX defines a library of operators (Add, Mult, Concat, AWF, etc.) and systematically considers all non‑empty subsets of stage logits. This yields a broad spectrum of intermediate predictions, thus capturing complementary spatial cues that cannot be captured by using simply a single operator (i.e., Add) as in [18].
Fully automatic supervision via NAS‑inspired Softplus weighting: We replace the uniform weights (i.e., 1.0) in [18] with a Softplus weighting scheme that learns in an end‑to‑end fashion how much each original or fused logit should contribute to the loss. As such, no heuristic tuning and no per‑dataset tweaking are needed; the weighting adapts itself during training, making supervision fully automatic.
Training‑only, plug‑and‑play module with zero inference overhead: LoMiX operates purely at the logit/loss level. We do not modify the encoder/decoder blocks, we do not add stages, we do not keep any extra heads at inference time. The deployed network remains identical: no extra parameters, no additional latency, directly mitigating any practical adoption barriers of our approach.
Targets data‑limited scenarios: Our largest gains appear when data are limited: the scenario where the supervision quality (not the network architecture) is critical. By implicitly ensembling multi‑stage predictions during training, LoMiX delivers consistent DICE improvements across all backbones and datasets. This trend is not observed in [18].
Rigorous analysis and robustness checks: We go beyond headline numbers: operator and subset ablations (Fig. 3 and Supplementary Table S6), weight‑dynamics curves over training (Supplementary Figs. S1-S8), cross‑architecture evaluations (Fig. 5 and Supplementary Table S8), and data‑efficiency assessments (Fig. 2 and Supplementary Table S3), all substantiate why and how LoMiX works. This breadth of experimental evidence is absent in [18].

To sum up, LoMiX introduces a new, general, learnable, and fully automatic supervision method; this is a clear technical and practical contribution beyond the fixed, single-operation mutation strategy of [18].

[18] Rahman, M.M. and Marculescu, R. Multi-scale hierarchical vision transformer with cascaded attention decoding for medical image segmentation. MIDL, 2023.

2025-08-03

I want to thank the authors for a very thorough rebuttal. Regarding Q2. I think the answer was informative and very much appreciate the included experiments but have some follow up questions.

Is the number of extra logit maps (2^L-L-1) for each class or constant?
The example taken with 96x96x96 for computing the extra memory requirements - what was the architecture considered?
How much extra memory would be required for back-propagation with the same setup if 128x128x128 was considered instead?
Was the input size used in the reported experiments 96x96x96?

2025-08-04

We sincerely thank the reviewer for the thoughtful follow-up questions. Our responses are given below.

FQ1: Is the number of extra logit maps (2^L-L-1) for each class or constant?

Response: $2^L − L − 1$ counts all possible non-empty fusions of the $L$ decoder stages, hence is independent of the number of classes $C$ . In other words, each fused logit map has $C$ classes, but the number of fused logit maps (i.e., $2^L − L − 1$ ) does not grow with $C$ .

FQ2. The example taken with 96x96x96 for computing the extra memory requirements - what was the architecture considered?

Response: We used the MedNeXt-M (kernel size = 3) [11] for computing the extra memory requirements and producing all new results in Table R1.

[11] Roy, S., et al. Mednext: transformer-driven scaling of convnets for medical image segmentation. MICCAI, 2023.

FQ3. How much extra memory would be required for back-propagation with the same setup if 128x128x128 was considered instead?

Response: For 128×128×128 input patches, LoMiX would require 16.85 GiB extra memory over the baseline - only last-layer (stage) supervision. The per-fusion operator contributions at 128×128×128 are as follows: add 2.53 GiB, mul 6.06 GiB, awf 5.73 GiB, concat 2.53 GiB. Please see Table R7 below for additional details.

Table R7: Memory requirements for different supervision methods. MedNeXt-M (kernel size = 3) with last $L$ = 4 decoder stages and $C$ = 9 output classes are used for this profiling.

Method	96×96×96 input	128×128×128 input	Extra Memory at 128×128×128 vs. Last-Layer
Last-Layer	10.07 GiB	18.61 GiB	—
Deep Supervision	10.07 GiB	18.61 GiB	0 GiB
MUTATION	11.16 GiB	21.14 GiB	+ 2.53 GiB
LoMiX (Ours)	18.74 GiB	35.46 GiB	+ 17.78 GiB

This is a worst-case profiling with only considering gradient-checkpointing. Extra reductions are feasible (e.g., chunk-wise backward or selective CPU off-loading), but they were not used in our current experiments. Of note, given the 3D patches of 96 and 128 are most common, these overheads are very reasonable for training purposes.

FQ4: Was the input size used in the reported experiments 96x96x96

Response: Yes, we used 96×96×96 input patches for reported results in Table R1.

2025-08-05

I have no more questions and want to thank the authors again for answering all of my questions so thoroughly.

2025-08-05

Dear Reviewer vjqb,

Thank you once again for your time in reviewing our paper. We are pleased to have addressed all of your concerns and questions. We would be happy to provide any additional information that might help you reassess the initial scores.

Sincerely,

The Authors

最终决定Accept (poster)

2025-09-17

The paper presents a novel training-time supervision method to improve U-shaped networks for medical image segmentation. It addresses the limitations of traditional supervision, which either focuses only on the final layer or uses equally weighted deep supervision on multi-scale logits.

The most important reason for the decision to accept the paper was the authors' successful rebuttal, which directly and convincingly addressed the main weaknesses. While the paper was initially seen as "borderline," the authors provided crucial additional experiments and clarifications that demonstrated the method's feasibility and robustness. The authors’ detailed explanations about the computational requirements for a 3D setting and their provision of preliminary 3D results alleviated a major reviewer concern.