Delving into Cascaded Instability: A Lipschitz Continuity View on Image Restoration and Object Detection Synergy
摘要
评审与讨论
This paper investigates the instability in cascaded fashion, image restoration, and object detection, under adverse conditions, and proposes a smooth Lipschitz continuity framework, namely LR-YOLO, which integrates restoration into the detection network and adds regularization in both input and parameter spaces. LR-YOLO attributes the instability problem to a mismatch in Lipschitz continuity—restoration networks are smooth, while detection networks are sensitive to small changes. LR-YOLO improves detection robustness and stability with low computational cost, outperforming prior approaches on haze and low-light benchmarks.
优缺点分析
Strengths:
- This paper's motivation is quite clear and straightforward.
- The implementation looks simple and effective.
Weaknesses
-
The low-light images are synthesized by the gamma correction. However, in real scenes, low-light images are typically accompanied by noise.
-
In row 206. It seems that the detection loss can also suppress the Lipschitz constant. Since the main task is detection, the gradient of the detection loss is generally aligned with the Jacobian norm of one. Thus, the detection loss also satisfies the assumption in row 204. Overall, suppressing the Lipschitz constant doesn't seem to be solely driven by the restoration loss.
I am concerning that the penalty of is enough to harmonize the instability.
Presentation:
-
Figure 2 (a) does not clearly explain why the purple dot cannot be seen in 'ours'.
-
In Row 203, what does 'small' mean?
Experiments:
- In the Experiment Section, it looks like the Image enhancement methods are just training on the synthesis image pairs. I think the author should also compare with the pretrained image enhancement methods. (For example, using the Image enhancement methods to yield clear images for training and testing).
问题
Please see my review.
局限性
No potential negative societal impact.
最终评判理由
Thanks to the author's detailed answer, which addressed my concerns, I chose to raise my score to 5: Accept.
格式问题
No
We sincerely appreciate your insightful comments. In response, we clarify the low-light setting and evaluate robustness to noise (Q1), clarify the role of regularization and gradient dynamics (Q2), revise visualizations and expressions for improved clarity (Q3–Q4), and compare with pretrained enhancement methods (Q5). Detailed responses are provided below.
Q1: The low-light images are synthesized by the gamma correction. However, in real scenes, low-light images are typically accompanied by noise.
A1: Thanks for the insightful comment. We follow standard protocols [a, b, c] using gamma correction to ensure consistency with existing benchmarks.
We agree that real-world low-light images often contain noise. To evaluate robustness under such conditions, we introduce Poissonian-Gaussian noise during testing. As shown in the table below, our method maintains strong performance, indicating good generalization to more realistic low-light scenarios.
| Method | VOC_Dark_Noise_Val | ExDark |
|---|---|---|
| YOLOv8 | 49.7 | 50.0 |
| RetinexformerYOLOv8 | 64.3 | 48.7 |
| IAYOLOv8 | 63.1 | 50.1 |
| GDIPYOLOv8 | 65.8 | 51.5 |
| FeatEnHancerYOLOv8 | 64.9 | 51.9 |
| LR-YOLOv8 (Ours) | 67.8 | 54.8 |
[a] Image-Adaptive YOLO for Object Detection in Adverse Weather Conditions, AAAI 2022
[b] Rethinking Image Restoration for Object Detection, NeurIPS 2022
[c] Unsupervised Variational Translator for Bridging Image Restoration and High-Level Vision Tasks, ECCV 2024
Q2-1: It seems that the detection loss can also suppress the Lipschitz constant, as its gradient is generally aligned with the Jacobian norm. Suppressing the Lipschitz constant doesn't seem to be solely driven by the restoration loss.
A2-1: The effect of detection loss on Lipschitz suppression is inconsistent. While it can occasionally reduce the Lipschitz constant when its gradient aligns with the Jacobian norm (i.e., ), object detection involves non-smooth components such as bounding box matching, which often introduce sharp gradient changes. These can result in , increasing the Lipschitz constant and destabilizing training.
In contrast, the restoration loss plays a primary and consistent role in suppressing the Lipschitz constant by providing smooth, stable gradients. As shown in Figure 5, its inclusion lowers and stabilizes both Jacobian and gradient norms.
We will add this discussion to Line 206 to clarify the effect of the detection loss.
Q2-2: I am concerning that the penalty of is enough to harmonize the instability.
A2-2: We clarify that enforces Lipschitz regularization by promoting smoothness in the parameter space. When applied alone, it can improve performance over the baseline. For example, YOLOv10 improves from 46.0 to 47.2 mAP on RTTS.
Furthermore, when combined with input-space regularization via the restoration loss , performance further increases to 49.2 mAP: a further gain of 2.0%. This indicates the complementarity of the two regularizers: gradient norm stabilizes parameters, while restoration loss smooths input supervision.
Figure 5 also supports this, LR-YOLOv8 (ours) achieves lower Jacobian norms than LR-YOLOv8* (using only ), leading to more stable training.
Q3: Figure 2 (a) does not clearly explain why the purple dot cannot be seen in 'ours'.
A3: The points corresponding to (purple) and (yellow) are nearly overlapping, appearing as dark yellow. To improve clarity, we will update the Figure 2 using distinct marker shapes to better distinguish the two points.
Q4: In Row 203, what does 'small' mean?
A4: Thanks for your meticulous comment. We have fixed this typo:
for
small
We refer to the gradient norm of the restoration loss, , being smaller than that of the detection loss, .
Q5: Using the pretrained image enhancement methods to yield clear images for training and testing.
A5: Our method achieves consistently better performance than the pretrained image enhancement methods. As suggested, we include the Transformer-based pretrained method OneRestore [a] and the Diffusion-based pretrained method PromptFix [b] for evaluation. Please see the results in the table below.
| Method | VOC_Haze_Val | RTTS | VOC_Dark_Val | ExDark |
|---|---|---|---|---|
| YOLOv8 | 54.3 | 45.3 | 63.4 | 50.0 |
| OneRestoreYOLOv8 | 62.1 | 38.4 | 68.9 | 51.3 |
| PromptFixYOLOv8 | 78.2 | 49.3 | 63.6 | 50.1 |
| LR-YOLOv8 (Ours) | 83.3 | 53.2 | 71.7 | 54.5 |
[a] OneRestore: A Universal Restoration Framework for Composite Degradation, ECCV 2024
[b] PromptFix: You Prompt and We Fix the Photo, NeurIPS 2024
Q6: No potential negative societal impact.
A6: We discuss potential societal impacts in Appendix G: Broader Impacts. Specifically:
-
Improved detection capabilities in low-visibility conditions may enhance public safety but also raise privacy concerns in surveillance applications.
-
The proposed method’s effectiveness in challenging environments may attract interest from security-related applications, which, while unintended, highlights the importance of responsible deployment.
Dear Reviewer KmkB,
We appreciate your valuable feedback, which has greatly contributed to improving our work.
Best regards,
Authors
Thanks to the author's detailed answer, which addressed my concerns, I chose to raise my score to 5: Accept.
This paper proposes a Lipschitz-regularized object detection framework (LROD) to address the functional mismatch between image restoration and detection networks in adverse conditions. The key innovation lies in analyzing the Lipschitz continuity disparity between restoration (low-Lipschitz) and detection (high-Lipschitz) tasks, which causes instability in traditional cascade frameworks.
优缺点分析
Strengths:The Lipschitz continuity analysis in both input and parameter spaces provides theoretical grounding for the instability issues in cascade frameworks. The dual regularization approach is well-motivated and demonstrates superior performance over existing methods. Extensive experiments on multiple datasets with detailed per-class results validate the method's robustness.
Weaknesses:
- Limited Generalization Tests: While haze and low-light conditions are covered, other adverse conditions (rain, snow, mixed weather) mentioned in Limitations are not experimentally validated.
- The claim of "extending to diverse detection architectures" lacks supporting evidence.
问题
-
Limited Generalization Tests: While haze and low-light conditions are covered, other adverse conditions (rain, snow, mixed weather) mentioned in Limitations are not experimentally validated.
-
The claim of "extending to diverse detection architectures" lacks supporting evidence.
-
I am excited to see this theoretical support between restoration and detection. Actually, i am wondering if some interesting findings are also in Camouflaged Object Detection. These relevance reference are encouraged to be discussed, [2] also mentioned the high-resolution is beneficial for object detection:
[1] Camouflaged Object Detection;
[2] High-resolution Iterative Feedback Network for Camouflaged Object Detection
局限性
yes
最终评判理由
I lean to accept this paper due it addresses all my concerns.
格式问题
N/A
Thank you very much for the insightful comments. In response, we extend our evaluation to diverse degradation types (Q1), validate our framework on additional detection architectures (Q2), and discuss its relevance to camouflaged object detection (Q3). Detailed responses are provided below.
Q1: Generalization to other adverse conditions (rain, snow, mixed weather)
A1: Our method naturally extends to a range of adverse conditions. To validate this, we extended our evaluation to additional adverse conditions, including motion blur, rain, snow, and mixed degradations (e.g., haze + rain).
Across all settings, our approach consistently outperforms existing baselines. Please refer to the results provided in the tables below for detailed comparisons.
We construct new training and validation sets under each degradation setting. All models are retrained on the respective training sets and evaluated on the corresponding validation sets.
| Method | Motion Blur | Rain | Snow | Haze + Rain |
|---|---|---|---|---|
| YOLOv8 | 50.8 | 53.1 | 60.8 | 50.1 |
| ConvIRYOLOv8 | 80.1 | 79.9 | 80.5 | 79.2 |
| IAYOLOv8 | 79.6 | 79.9 | 80.3 | 78.0 |
| GDIPYOLOv8 | 80.1 | 79.6 | 80.4 | 78.3 |
| FeatEnHancerYOLOv8 | 80.2 | 79.6 | 79.6 | 78.9 |
| LR-YOLOv8 (Ours) | 82.3 | 82.5 | 83.0 | 81.6 |
The above results will be included in Section 5.3.
Q2: Extend to diverse detection architectures.
A2: Our LROD framework is both compatible with and effective across different detection paradigms. During rebuttal, we applied it to the shared backbone of both the transformer-based RT-DETR [a] and the two-stage Faster R-CNN [b]. In both cases, LROD consistently remains effective.
Please refer to the detailed results presented in the tables below.
All models are trained on VOC_Haze_Train and evaluated on both the synthetic dataset VOC_Haze_Val and the real-world dataset RTTS.
- 1. Transformer-based detector RT-DETR:
| Method | VOC_Haze_Val | RTTS |
|---|---|---|
| RTDETR | 51.5 | 43.7 |
| ConvIRRTDETR | 76.0 | 43.7 |
| IARTDETR | 76.8 | 43.6 |
| GDIPRTDETR | 72.6 | 43.5 |
| FeatEnHancerRTDETR | 73.3 | 42.4 |
| LR-RTDETR | 78.9 | 45.1 |
- 2. Two-stage detector Faster R-CNN Results:
| Method | VOC_Haze_Val | RTTS |
|---|---|---|
| FasterRCNN | 69.1 | 43.2 |
| ConvIRFasterRCNN | 78.5 | 44.1 |
| IAFasterRCNN | 78.6 | 41.3 |
| GDIPFasterRCNN | 76.6 | 44.5 |
| FeatEnHancerFasterRCNN | 77.7 | 39.4 |
| LR-FasterRCNN | 80.2 | 45.9 |
We will include the above experiments in Section 5.3.
[a] DETRs Beat YOLOs on Real-time Object Detection, CVPR 2024
[b] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, NeurIPS 2015
Q3: If some interesting findings also in Camouflaged Object Detection [1] ... high-resolution is beneficial for object detection [2]
A3: Thanks for raising this discussion. We believe that our analysis on the Lipschitz continuity disparities between restoration and detection networks extends naturally to camouflaged object detection (COD), where similar challenges in boundary stability arise.
In particular, COD involves detecting objects with ambiguous, low-contrast boundaries, posing challenges similar to those in cascaded systems under adverse conditions. Our findings suggest that instability around object boundaries in COD may stem from the non-smooth behavior of detection networks. The high-resolution preservation enabled by restoration modules can help address this, as fine edge and texture details are essential for accurate localization. This observation is consistent with [2], which highlights the importance of high-resolution features in improving COD performance.
We will incorporate the above discussion and include the related works [1, 2] in Lines 320-325.
[1] Camouflaged Object Detection
[2] High-resolution Iterative Feedback Network for Camouflaged Object Detection
Thanks for authors' feedback, I lean to accept this paper and hope it will bring more theoretical analysis to our community.
Dear Reviewer AYoA,
Thank you for your positive feedback and support. We appreciate your suggestions for improving our paper.
Kind regards,
The Authors
This paper investigates the instability in cascaded image restoration and object detection pipelines under adverse conditions (e.g., haze, low light), attributing the issue to a mismatch in Lipschitz continuity between the two tasks. Restoration networks exhibit smooth, continuous mappings, whereas detection networks are highly sensitive and discontinuous. To address this, the authors propose Lipschitz-Regularized Object Detection (LROD), a framework that integrates low-Lipschitz restoration directly into the detection backbone and applies parameter-space smoothing to stabilize the training process. Implemented as LR-YOLO on top of YOLO detectors, the method improves detection robustness, stability, and accuracy across challenging benchmarks, demonstrating its effectiveness and efficiency over traditional cascaded or jointly trained approaches.
优缺点分析
Strengths: The paper presents a well-motivated and original perspective on the instability of cascaded image restoration and object detection systems by analyzing their functional mismatch through the lens of Lipschitz continuity. The theoretical analysis is thorough and clearly articulated, spanning both input and parameter spaces, which enhances the paper's clarity and depth. The proposed solution—Lipschitz-Regularized Object Detection (LROD)—is simple yet effective, demonstrating strong empirical results across multiple benchmarks and outperforming existing cascaded and joint training baselines. The integration into YOLO detectors also highlights the practical significance and deployability of the method in real-time applications.
Weaknesses: While the paper provides a compelling analysis and effective solution, it focuses exclusively on YOLO-based detectors, thereby limiting its generalizability to other detection paradigms, such as Transformer-based models (e.g., DETR). Additionally, the paper lacks an ablation on alternative regularization strategies or deeper architectural variants. The experimental scope, while strong in haze and low-light conditions, could be broadened to include other forms of degradation (e.g., motion blur, rain, snow, or a combination of these), further validating the robustness and versatility of the approach.
问题
-
Generality Beyond YOLO-based Detectors Question: The proposed method is closely connected to the YOLO architecture. Have you tried or considered applying LROD to other object detection frameworks, such as Transformer-based models (e.g., DETR, RT-DETR) or two-stage detectors like Faster R-CNN? Are there any gradient flow or optimization issues with the models mentioned above?
-
Clarification of Restoration Loss Contribution During Inference Question: Is the restoration module used during inference, or is it only for training regularization? If it’s discarded after training, how does its supervision affect feature stability during inference?
-
Ablation on Alternative Regularization Strategies Question: What are the primary advantages of this method compared to other regularization techniques or some adversarial training methods?
局限性
yes
最终评判理由
Thank you for the strong submission and the well-prepared rebuttal. The paper addresses an important and challenging problem with a clear motivation, novel methodology, and good experimental validation. I appreciate the work and the clarity in both the paper and the rebuttal.
格式问题
No
Thank you very much for the constructive comments. We validate the effectiveness of our framework across different detector paradigms (Q1-1) and degradation scenarios (Q3), and conduct ablation studies on regularization strategies (Q2-1) and architectural designs (Q2-2). Below, we provide our point-by-point responses.
Q1-1: Generalizability to other detection paradigms (e.g., DETR, RT-DETR) or two-stage detectors like Faster R-CNN?
A1-1: Our LROD framework is both compatible with and effective across other detection paradigms. Following the suggestion, we applied it to the shared backbone of both the transformer-based RT-DETR and the two-stage Faster R-CNN. In both cases, LROD remains effective, as demonstrated by the results reported in the tables below.
All models are trained on VOC_Haze_Train and evaluated on both the synthetic dataset VOC_Haze_Val and the real-world dataset RTTS.
- 1. Transformer-based detector RT-DETR:
| Method | VOC_Haze_Val | RTTS |
|---|---|---|
| RTDETR | 51.5 | 43.7 |
| ConvIRRTDETR | 76.0 | 43.7 |
| IARTDETR | 76.8 | 43.6 |
| GDIPRTDETR | 72.6 | 43.5 |
| FeatEnHancerRTDETR | 73.3 | 42.4 |
| LR-RTDETR | 78.9 | 45.1 |
- 2. Two-stage detector Faster R-CNN Results:
| Method | VOC_Haze_Val | RTTS |
|---|---|---|
| FasterRCNN | 69.1 | 43.2 |
| ConvIRFasterRCNN | 78.5 | 44.1 |
| IAFasterRCNN | 78.6 | 41.3 |
| GDIPFasterRCNN | 76.6 | 44.5 |
| FeatEnHancerFasterRCNN | 77.7 | 39.4 |
| LR-FasterRCNN | 80.2 | 45.9 |
We will include the above experiments in Section 5.3.
Q1-2: Any gradient flow or optimization issues with the models mentioned above?
A1-2: We observe that RT-DETR and Faster R-CNN exhibit similar gradient flow and optimization issues to those in YOLO. Both models suffer from sharp gradient transitions and unstable optimization behavior. Our proposed regularization facilitates smoother gradient propagation, and the overall training process is more stable.
This is evidenced by the upper bounds on the Lipschitz constant, as presented in the table below.
| Paradigm | Faster R-CNN | RT-DETR | ||||
|---|---|---|---|---|---|---|
| Method | FasterRCNN | ConvIRFasterRCNN | LR-FasterRCNN | RTDETR | ConvIRRTDETR | LR-RTDETR |
| Upper Bounds on Lipschitz Constant | 287.6 | 691.7 | 174.3 | 246.3 | 619.4 | 155.6 |
Q2-1: Ablation on alternative regularization strategies.
A2-1: During the rebuttal, we conducted an ablation study comparing our method with two alternative regularization strategies: Spectral Norm Regularization (SNR) [a] and the adversarial training method Projected Gradient Descent (PGD) [b].
Our approach consistently outperforms both SNR and PGD, as shown in the table below.
All models are trained on VOC_Haze_Train and evaluated on RTTS to assess out-of-domain generalization.
| Method | Baseline | SNR [a] | PGD [b] | Ours |
|---|---|---|---|---|
| RTTS | 49.3 | 50.1 | 40.8 | 53.2 |
[a] Spectral Normalization for Generative Adversarial Networks, ICLR 2018
[b] Towards Deep Learning Models Resistant to Adversarial Attacks, ICLR 2018
We further discuss the advantages over the two regularization techniques as follows,
First, compared to SNR, which constrains weights globally, our method penalizes , reducing output sensitivity to parameter changes and enabling input-aware smoothness. Further, the restoration network promotes feature smoothness aligned with detection.
Second, compared to adversarial training (e.g., PGD), which requires generating perturbed inputs and increases training cost, our approach achieves implicit robustness without adversarial examples, resulting in more stable and efficient training. Moreover, while PGD often compromises performance on clean images, our method maintains accuracy on both clean and degraded inputs.
Q2-2: Ablation on deeper architectural variants.
A2-2: To explore the impact of backbone sharing depth, we conducted an ablation study using different configurations of shared encoder stages between the detection and restoration heads.
Our method shares the first three stages (F1–F3), aiming to strike a balance between feature smoothness and task-specific specialization. As shown in the table below, shallower sharing (F1–F2) limits regularization, while deeper sharing (F1–F4) introduces task interference, supporting our design choice of the first three stages (F1–F3).
All strategies are trained on VOC_Haze_Train and evaluated on RTTS to assess out-of-domain generalization.
| Stage Sharing | Baseline | F1,F2 | F1,F2,F3 (Ours) | F1,F2,F3,F4 |
|---|---|---|---|---|
| RTTS | 49.3 | 51.9 | 53.2 | 52.8 |
The above two ablation studies will be included in Section 5.3.
Q3: Include other forms of degradation (e.g., motion blur, rain, snow, or a combination of these).
A3: As suggested, we evaluated our method on additional degradation types, including motion blur, rain, snow, and haze–rain mixtures.
Across all scenarios, our method consistently outperforms existing approaches, demonstrating its versatility and robustness under diverse adverse conditions, as shown in the tables below.
We construct new training and validation sets under each degradation setting. All models are retrained on the respective training sets and evaluated on the corresponding validation sets.
| Method | Motion Blur | Rain | Snow | Haze + Rain |
|---|---|---|---|---|
| YOLOv8 | 50.8 | 53.1 | 60.8 | 50.1 |
| ConvIRYOLOv8 | 80.1 | 79.9 | 80.5 | 79.2 |
| IAYOLOv8 | 79.6 | 79.9 | 80.3 | 78.0 |
| GDIPYOLOv8 | 80.1 | 79.6 | 80.4 | 78.3 |
| FeatEnHancerYOLOv8 | 80.2 | 79.6 | 79.6 | 78.9 |
| LR-YOLOv8 (Ours) | 82.3 | 82.5 | 83.0 | 81.6 |
The above results will be included in Section 5.3.
Q4: Clarification of Restoration Loss Contribution During Inference: Is the restoration module used during inference, or is it only for training regularization? If it’s discarded after training, how does its supervision affect feature stability during inference?
A4: The restoration and detection networks share a common backbone. During inference, since the primary task is object detection, the restoration head can be removed without affecting detection performance. Furthermore, if both detection results and restored images are desired, the restoration head can be retained to generate the restored outputs.
We clarify that the primary role of restoration supervision is during training, which helps suppress the model’s sensitivity to input perturbations, thereby constraining the Lipschitz constant of the detection network (Figure 5) and enhancing both its robustness and feature stability at inference time (Table 4).
Dear Reviewer te96,
Thank you for your service to NeruIPS 2025 paper review. What do you think of the author rebuttal? Did they address your concerns? Could you please kindly help share your further opinions? Thank you.
Best regards, Your AC
Dear authors,
I reviewed your rebuttal several days ago and apologize for the delay in providing confirmation. The rebuttal has addressed my questions well, and I appreciate your thorough efforts in preparing it. Everything looks good from my side—thank you again for the thoughtful response. I will maintain my positive ratings for your submission.
Best, Reviewer
Dear Reviewer te96,
Thank you for your positive feedback and constructive suggestions. We are pleased that our responses addressed your concerns.
Best regards,
Authors
Generally in adverse visual conditions (like haze or low light), it is common to apply image restoration before object detection. But this cascade often suffers from instability, where small artifacts from restoration get amplified in detection. The paper investigates this issue through Lipschitz continuity, showing that the mismatch in smoothness between restoration and detection leads to poor performance. The authors had the following key observations: i) Restoration networks are low-Lipschitz (smooth). ii) Detection networks are high-Lipschitz (non-smooth). iii) Cascading the two causes gradient amplification and instability in both input and parameter space. Therefore, they proposed a unified framework called LROD (Lipschitz-Regularized Object Detection) that integrates image restoration within the detector’s feature extraction stage and adds parameter-space regularization to smooth training dynamics. This results in LR-YOLO, an enhanced YOLO-based detector that outperforms traditional cascades under adverse conditions. The Authors argued that LR-YOLO achieves higher mAP scores, better training stability and lower Lipschitz constants.
优缺点分析
Strengths: The discussion around restoration and detection instability is through Lipschitz continuity, which is both novel and mathematically sound. According to the authors, the LROD framework is a lightweight modification and low computational overhead. Moreover, the authors provided analytical derivations in both input and parameter spaces, including smoothness bounds and optimization stability. Personally I like its clear architectural design, and that it is designed to be lug-and-play for most YOLO variants.
Weaknesses: i) It is nice that the authors have shown their extensive evaluation on synthetic and real-world datasets. However, the focus seems to be on only haze and low-light scenarios. Other degradation types (rain, motion blur, snow, etc.) are only mentioned as future work. It would be interesting to see how well this framework generalizes to other conditions and if the difference between these condition types are clearly stated. ii) In Lipschitz constants estimation, empirical Jacobian norms are used to estimate Lipschitz continuity, which may be noisy or dataset-specific. A discussion on estimation variance or robustness would be preferred. iii) Although the two-part regularization is evaluated, the backbone-sharing choice and stage-wise contribution of restoration to detection aren’t dissected in detail. iv) While Figure 6 helps, more clarity on how gradients flow jointly through detection and restoration heads would help readers understand the training interaction.
问题
i) Can there be a discussion about how you see the framework extends to more adverse scenarios as you mentioned in the future direction, if not extendable what are the limitations there? ii) Analysis about adversarial robustness is preferred, e.g. with the presence of Gaussian noise. iii) For regularization coefficients, it is better to include a sensitivity analysis to see how that affects mAP. iv) It would be good to also include non-YOLO detectors as baselines in the experiments although they are not fully studied.
局限性
The authors have listed a few limitations of their work, however I don't see discussion about potential negative societal impact.
最终评判理由
The authors have provided clear answers to address the concerns I raised earlier about the sensitivity of Jacobian, and have added more experiments/baselines to further address the robustness of the model performance. Therefore, I agreed to increase my rating by 1.
格式问题
No.
We sincerely appreciate your constructive comments. In response, we extend our evaluation to diverse degradations (Q1), provide analyses on Lipschitz estimation and robustness (Q2), conduct ablations on regularization and architecture (Q3), and validate generalizability across detectors (Q4). Our detailed point-by-point responses are provided below.
Q1: Extend to other degradation types (rain, motion blur, snow, etc.).
A1: Our method effectively extends to a variety of adverse conditions, including motion blur, rain, and snow. As shown in the table below, our method consistently outperforms existing approaches across all these scenarios.
We construct new training and validation sets under each degradation setting. All models are retrained on the respective training sets and evaluated on the corresponding validation sets.
| Method | Motion Blur | Rain | Snow | Haze + Rain |
|---|---|---|---|---|
| YOLOv8 | 50.8 | 53.1 | 60.8 | 50.1 |
| ConvIRYOLOv8 | 80.1 | 79.9 | 80.5 | 79.2 |
| IAYOLOv8 | 79.6 | 79.9 | 80.3 | 78.0 |
| GDIPYOLOv8 | 80.1 | 79.6 | 80.4 | 78.3 |
| FeatEnHancerYOLOv8 | 80.2 | 79.6 | 79.6 | 78.9 |
| LR-YOLOv8 (Ours) | 82.3 | 82.5 | 83.0 | 81.6 |
The above results will be included in Section 5.3.
Q2-1: Empirical Jacobian norms estimate Lipschitz continuity, which may be noisy or dataset-specific ... Discussion on the Lipschitz constant estimation variance.
A2-1: We agree that empirical Jacobian norms can be sensitive to noise and data distribution. To reduce this effect, we follow prior work [a, b] and compute norms over many randomly sampled validation inputs, which helps stabilize estimates and limit dataset-specific bias.
[a] Evaluating the Robustness of Neural Networks: An Extreme Value Theory Approach, ICLR 2018
[b] Some Fundamental Aspects about Lipschitz Continuity of Neural Networks, ICLR 2024
Furthermore, we include an analysis of the estimation variance and report confidence intervals to assess robustness in the table below. To complement the empirical approach, we also compute certified upper bounds using the dataset-independent LipSDP method [c]. Together, these results validate the use of Jacobian norms as a stable proxy for estimating Lipschitz continuity while accounting for variance and distribution sensitivity.
| Method | Mean | Variance | Confidence Interval | Upper Bounds on Lipschitz Constant |
|---|---|---|---|---|
| YOLOv8 | 19.5 | 18.9 | [17.4,21.7] | 215.9 |
| ConvIRYOLOv8 | 43.4 | 46.2 | [38.3,48.7] | 530.8 |
| LR-YOLOv8 (Ours) | 8.9 | 9.8 | [8.3,9.7] | 136.4 |
[c] Efficient and Accurate Estimation of Lipschitz Constants for Deep Neural Networks, NeurIPS 2019
Q2-2: Analysis about adversarial robustness is preferred.
A2-2: As suggested, we analyze adversarial robustness by evaluating performance under varying levels of Gaussian noise.
Our method maintains higher absolute performance and exhibits the smallest drop from the clean setting, demonstrating stronger resilience to input perturbations. Results are provided in the tables below.
| Method | Clean | ||||
|---|---|---|---|---|---|
| YOLOv8 | 45.3 | 45.1 (-0.2) | 44.1 (-1.2) | 42.3 (-3.0) | 40.1 (-5.2) |
| ConvIRYOLOv8 | 49.3 | 48.7 (-0.6) | 47.5 (-1.8) | 45.3 (-4.0) | 43.3 (-6.0) |
| LR-YOLOv8 (Ours) | 53.2 | 53.1 (-0.1) | 52.2 (-1.0) | 50.4 (-2.8) | 48.5 (-4.7) |
Q3-1: Ablation Study on the Backbone-Sharing Strategy.
A3-1: As suggested, we conducted an ablation study on backbone sharing and found that sharing the first three stages (F1–F3) yields the best balance between stability and task synergy. Shallower sharing (F1–F2) limits regularization, while deeper sharing (F1–F4) introduces interference. This supports our design choice of using the first three stages.
Please refer to the detailed results presented in the tables below.
All strategies are trained on VOC_Haze_Train and evaluated on RTTS to assess out-of-domain generalization.
| Stage Sharing | Baseline | F1,F2 | F1,F2,F3 (Ours) | F1,F2,F3,F4 |
|---|---|---|---|---|
| RTTS | 49.3 | 51.9 | 53.2 | 52.8 |
Q3-2: Ablation Study on the Regularization Coefficients.
A3-2: As suggested, we conducted an ablation study on the regularization coefficients (input space) and (parameter space). The results show that our method consistently outperforms the baseline across a range of coefficient values, with only minor variation in performance.
All settings are trained on VOC_Haze_Train and evaluated on RTTS.
| Baseline | Ours | |||||
|---|---|---|---|---|---|---|
| 0 | 10 | 10 | 10 | 20 | 5 | |
| 0 | 0.005 | 0.02 | 0.01 | 0.01 | 0.01 | |
| RTTS | 49.3 | 52.9 | 53.0 | 53.2 | 53.1 | 52.8 |
The above two ablation studies will be included in Section 5.3.
Q4: Include non-YOLO detectors as baselines.
A4: Our LROD framework is both compatible with and effective across other detection paradigms. During the rebuttal, we integrated it into the shared backbone of both the transformer-based RT-DETR [a] and the two-stage Faster R-CNN [b]. In both settings, it consistently enhances performance, as evidenced by the results in the tables below.
All models are trained on VOC_Haze_Train and evaluated on both the synthetic dataset VOC_Haze_Val and the real-world dataset RTTS.
- 1. Transformer-based detector RT-DETR:
| Method | VOC_Haze_Val | RTTS |
|---|---|---|
| RTDETR | 51.5 | 43.7 |
| ConvIRRTDETR | 76.0 | 43.7 |
| IARTDETR | 76.8 | 43.6 |
| GDIPRTDETR | 72.6 | 43.5 |
| FeatEnHancerRTDETR | 73.3 | 42.4 |
| LR-RTDETR | 78.9 | 45.1 |
- 2. Two-stage detector Faster R-CNN Results:
| Method | VOC_Haze_Val | RTTS |
|---|---|---|
| FasterRCNN | 69.1 | 43.2 |
| ConvIRFasterRCNN | 78.5 | 44.1 |
| IAFasterRCNN | 78.6 | 41.3 |
| GDIPFasterRCNN | 76.6 | 44.5 |
| FeatEnHancerFasterRCNN | 77.7 | 39.4 |
| LR-FasterRCNN | 80.2 | 45.9 |
We will include the above experiments in Section 5.3.
[a] DETRs Beat YOLOs on Real-time Object Detection, CVPR 2024
[b] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, NeurIPS 2015
Q5: Clarity on how gradients flow jointly through detection and restoration heads.
A5: We clarify that the detection and restoration heads share the first three backbone stages, and their losses jointly update these parameters. The detection loss provides task-specific supervision, while the restoration loss contributes smooth gradients that regularize training. As shown in Figure 5, incorporating the restoration loss (LR-YOLOv8*) stabilizes both Jacobian and gradient norms, highlighting the benefit of joint gradient flow.
We will also include a gradient flow diagram in Figure 6 for clarity.
Q6: The authors have listed a few limitations of their work, however I don't see discussion about potential negative societal impact.
A6: We discuss potential societal impacts in Appendix G: Broader Impacts. Specifically:
-
Improved detection capabilities in low-visibility conditions may enhance public safety but also raise privacy concerns in surveillance applications.
-
The proposed method’s effectiveness in challenging environments may attract interest from security-related applications, which, while unintended, highlights the importance of responsible deployment.
Thanks for the very detailed response from the authors, which took me some time to read and make my decision. I appreciate that the authors added the suggested experiment results and baselines to further address the model robustness, which makes the paper more technically solid. I would increase my rating to 5.
Dear Reviewer 419p,
Thank you for your thoughtful feedback and for taking the time to review our responses. We appreciate your updated rating and support.
Best,
The Authors
The paper received all positive recommendations. After rebuttal, all reviewers acknowledged the strengths of the paper and agreed that the authors have addressed their concerns. Reviewers achieved a final consensus of positive rating of this paper. AC agrees with this recommendation and therefore is happy to accept the paper. Authors are required to update the rebuttal and discussion contents to the camera-ready version of the paper to improve it so as to address the raised concerns in the final paper.