User-Instructed Disparity-aware Defocus Control
disparity-aware joint deblurring and reblurring as a proxy of refocusing
摘要
评审与讨论
This paper introduces the User-Instructed DoF Control (UID) framework, a novel method for interactive depth-of-field manipulation in out-of-focus (OOF) images. The core innovation is enabling users to specify a desired in-focus region by leveraging the Segment Anything Model (SAM) for intuitive object selection. To accurately estimate the spatially varying blur, the framework utilizes dual-pixel data, which provides crucial sub-pixel disparity information for depth inference. The proposed method unifies two related tasks—OOF deblurring and all-in-focus generation—into a cohesive two-stage architecture that shares a common latent representation, allowing for efficient and effective refocusing and image restoration. Experiments validate the framework's design, demonstrating the significant benefits of dual-pixel inputs and the joint-task approach.
优缺点分析
Strength
- Novel User Interaction: The integration of the Segment Anything Model (SAM) for defining the region of interest is a modern and highly effective approach to user interaction. It significantly improves usability over traditional methods like manual scribbles or mask drawing.
- Elegant Unified Framework: The two-stage architecture that jointly handles OOF deblurring and refocusing is a logical and efficient design. Sharing latent information between these highly related tasks is a promising direction.
Weakness
-
Clarity, Notation, and Figure Legibility: The paper's clarity is significantly hampered by notational inconsistencies and confusing figures. For instance, there is a discrepancy between the notation in the equations ( in Eq. 5, 6, 7) and the architecture diagram ( in Figure 2). Furthermore, the annotations used to describe the methodology in Section 3.1 do not align with those presented in Figure 2. These issues make it exceptionally difficult for the reader to follow the model's architecture and data flow.
-
Justification of Core Methodology (Decoder vs. Physics-Based Model): The paper's methodology relies on a learned decoder for the final refocusing step, but the motivation for this choice over a more traditional, physics-based approach is unclear. Given that the framework estimates disparity, this information could be used to model a Point Spread Function (PSF) for refocusing. A PSF-based method offers greater interpretability and may be less prone to the hallucination of textures—a known risk in purely generative models.
-
Unclear Design Rationale: The manuscript lacks a clear narrative justifying its architectural and component-level design choices. While the ablation studies demonstrate that each component contributes to the final performance, the paper does not adequately explain why these specific components were chosen over other well-established alternatives in the literature.
-
Absence of a Limitations Section: The paper fails to include a discussion of its limitations. The authors should discuss potential failure cases or negative impacts, such as performance on scenes with extreme blur, subjects where SAM segmentation might be ambiguous, or the method's sensitivity to the quality of the dual-pixel input.
-
Inadequate Citation Practices: Citations for baseline methods are missing from several tables and figure captions. This forces the reader to search through the text to identify the compared methods, significantly harming readability. For clarity and proper academic credit, all baseline methods must be cited directly within those elements.
-
Fairness of Experimental Comparison: The experimental evaluation compares against single-image (SI) and dual-pixel (DP) defocus deblurring methods. Comparing a method that utilizes dual-pixel (DP) data against those that use only a single image is not a fair assessment of state-of-the-art performance. The paper's central claims would be much stronger if the primary comparison was against other existing methods that also leverage dual-pixel data. Simply cite those SI methods properly in related work and focus more on the ablations studies.
There are significant and concerning discrepancies between the manuscript and the authors' responses in the NeurIPS checklist. This raises serious questions about the submission process.
- On Limitations: Checklist item 2 asserts that limitations are addressed in the future work in the supplementary material. However, the manuscript lacks a dedicated limitations section. The "Future Work" section the main paper is not an adequate substitute for a critical analysis of the current method's weaknesses and failure modes.
- On Data and Code Access: Checklist item 5 response regarding the public availability of code and data appears to be inconsistent. The author should stated "No" if they wish to published after acceptance.
- On Computational Resources: Checklist item 8, the submission does not detail the computational resources used. While implementation details are provided, information regarding the hardware (e.g., GPU model), training times, and inference speeds is absent, which is critical for assessing reproducibility and efficiency.
问题
Regarding Clarity & Methodology
- Could the authors please provide a revised version of Figure 2? The current diagram is difficult to parse due to notational inconsistencies and an unclear depiction of the framework's data flow.
- What is the specific rationale for using a learned decoder over a more interpretable, physics-based Point Spread Function (PSF) model? Could the authors demonstrate or elaborate on the distinct advantages of their chosen approach?
- Could the authors elaborate on the design rationale for their architectural components? For instance, what motivated the choice of specific modules over other common alternatives in the literature?
Regarding Robustness & Limitations
- How robust is the framework to challenging inputs? We ask the authors to provide analysis on failure cases, specifically regarding (a) inaccurate segmentations from SAM, and (b) images with very large amounts of defocus blur.
- Could the authors add the "Limitations" section as indicated in their submission checklist? This section should detail the method's specific failure modes and boundary conditions.
Regarding Submission Compliance
-
Could the authors provide the missing details on computational resources (GPU model, training/inference times) as stated in the checklist? This information is essential for reproducibility.
-
Could the author update the checklist for "Open access to data and code".
局限性
No, stated in Weakness 4.
最终评判理由
I would like to confirm that the following points have been effectively addressed and have greatly strengthened the manuscript:
-
Methodology & Rationale: The explanation of the invertible network and how it functions within a PSF-based paradigm was a crucial clarification. This resolved my primary misunderstanding regarding the technical foundation of the method. It is now clear that the core contribution is the insightful synthesis of these techniques, motivated by the inverse relationship between deblurring and reblurring.
-
Robustness & Reproducibility: The new quantitative experiments on SAM's performance are excellent and provide convincing evidence of the framework's robustness. Furthermore, providing the specific computational details has fully addressed my concerns about reproducibility.
-
Promised Revisions: I acknowledge the commitment to revise the paper's clarity, add the missing citations, include a dedicated limitations section, and update the submission checklist. I trust these changes will be incorporated into the final version.
格式问题
No
1.Clarity and Notation.
Thank you for the feedback. We will revise the data flow and notation in the final version to improve readability.
2.Rationale of Method.
Actually, our network still adheres to the PSF-based framework. The refocusing process does not employ an additional decoder; instead, it follows the workflow indicated by the black arrows and orange arrows in Fig. 2.
black arrows: Given a deblurred image and a user-specified mask of the region of interest, the modified disparity is first estimated for blur kernel retrieval.
orange arrows: Then, blur kernel is used to achieve spatially-variant defocus control using inverse mapping of our invertible network.
3.Architectural and Component-level Explanation.
Thank you for your suggestion. Here, we clarify the design motivation behind our architecture and core components. Architectural Explanation: In essence, our goal is to design a framework for refocusing, a process can be typically decomposed into two steps: (1) first deblurring to obtain a sharp image, and (2) then reblurring specific regions according to user needs to achieve the desired defocus effect.
Mathematically, deblurring process can be modeled as = , where is the blur image, and is the restored image. According to the convolutional theorm, we have = . Its inverse process can be represented as a reblurring process, = = .
We observe that the blur kernels and used in these two processes satisfy a reversibility property. This insight motivates us to learn an invertible network with reversible kernels. Intuitively, an effective blur kernel should be capable of both restoring an image (deblurring) and blurring it (reblurring). Therefore, we supervise the network's joint learning using both the reblurring loss and the deblurring loss. By co-training these two tasks within a single network, we achieve refocusing efficiently—reducing training time and lowering the model parameter count.
Component-level Explanation: As illustrated in Fig. 2, our model follows a common architecture comprising an encoder (), a refine component, and a decoder (). Our primary contribution lies within the refine module, which consists of:
(i) Disparity-aware Feature Convolution: This component retrieves the corresponding kernel from the kernel dictionary using the estimated disparity. The forward process (deblurring) uses kernel for modulation, while the inverse process (reblurring) employs its corresponding inverse kernel for modulation.
(ii) Invertible Blocks: To further enhance the representational capacity of network features and strengthen the network's inherent reversibility, we design invertible blocks, denoted as and , which has been elaborated in Tab.1 in the appendix. This design ensures strict invertibility of the network (enabling lossless, reversible information flow [3]), where the Jacobian matrix of the output with respect to the input is invertible.
4.Absence of a Limitations Section.
Thank you for the suggestion. We will include a limitation section in the final version.
5.Citation.
Thank you. We will add the missing citations in the table.
6.Experimental Comparison.
Regarding this point, our experimental table setup follows the conventions established in most prior works [1, 2]. We will revise the formatting according to your recommendations.
6. Could the authors please provide a revised version of Figure 2? The current diagram is difficult to parse due to notational inconsistencies and an unclear depiction of the framework's data flow.
Please refer to Response 1.
7.What is the specific rationale for using a learned decoder over a more interpretable, physics-based Point Spread Function (PSF) model? Could the authors demonstrate or elaborate on the distinct advantages of their chosen approach?
Please refer to Response 2.
8.Could the authors elaborate on the design rationale for their architectural components? For instance, what motivated the choice of specific modules over other common alternatives in the literature?
Please refer to Response 3.
9.How robust is the framework to challenging inputs? We ask the authors to provide analysis on failure cases, specifically regarding (a) inaccurate segmentations from SAM, and (b) images with very large amounts of defocus blur. Could the authors add the "Limitations" section as indicated in their submission checklist? This section should detail the method's specific failure modes and boundary conditions.
(a) Inaccurate segmentations from SAM: The language-based SAM, when guided by specially designed prompts, can generate sufficiently robust masks even on deblurred images. Given that the original SAM model was trained with diverse data augmentation, it exhibits inherent robustness to mild noise or blur. To further improve mask quality in our implementation:
- For region of interest, we use language-style prompts: number + object color + object name + spatial position (e.g., a brown dog on the chair).
- For those still imperfect masks, we further apply post-processing techniques such as dilation or erosion operations for refinement.
To evaluate the mask quality from our adopted SAM, we manually annotate two objects for each image of our collected images as the groundtruth. We compare their average IOU metrics,
| 3-click | language(only with object name) | language (number + object color + object name + spatial position) |
|---|---|---|
| 93.6 | 91.8 | 87.5 |
We would provide the failure visualization case in our limitation section in final version.
(b) Images with very large amounts of defocus blur. We empirically observe that our used SAM could functions well when handling the image with large defocus blur (i.e., -1.8). We sample two groups of sample from DP5K-test and, each of which has 5 cases with different -number, 1.8, 2.8, 4.0, and 5.6. We use the language-based SAM to generate the mask, and manually annotate the region with interest as the ground-truth.
We calculate their respective IOU metric.
| -1.8 | -2.8 | -4.0 | -5.6 |
|---|---|---|---|
| 91.3 | 91.6 | 91.8 | 91.8 |
We observe mariginal mask degradation. We would include above analysis, and provide more visualization in our final version. Thanks.
10. Could the authors provide the missing details on computational resources (GPU model, training/inference times) as stated in the checklist? This information is essential for reproducibility.
Our model is trained on one A100 80G GPU, and test on GTX A6000.
| dataset | training time (h) | latency (s) |
|---|---|---|
| DP5K-model | 5.02 | 0.64 |
| DPD-disp-model | 20.58 | 0.56 |
11. Could the author update the checklist for "Open access to data and code".
Yes, I will update it. Thanks.
Reference:
[1] Restormer: Efficient Transformer for High-Resolution Image Restoration, CVPR 22
[2] K3DN: Disparity-aware Kernel Estimation for Dual-Pixel Defocus Deblurring, CVPR 23
[3] Density estimation using real NVP, ICLR 2017
Dear Authors,
Thank you for your detailed and thoughtful rebuttal. Your responses have successfully addressed the vast majority of my concerns and have significantly clarified the technical contributions and positioning of your work.
Confirmation of Resolved Issues
I would like to confirm that the following points have been effectively addressed and have greatly strengthened the manuscript:
-
Methodology & Rationale: Your explanation of the invertible network and how it functions within a PSF-based paradigm was a crucial clarification. This resolved my primary misunderstanding regarding the technical foundation of your method. It is now clear that the core contribution is the insightful synthesis of these techniques, motivated by the inverse relationship between deblurring and reblurring. To further strengthen the paper, I encourage you to ensure this well-articulated rationale is clearly presented in the final manuscript.
-
Robustness & Reproducibility: The new quantitative experiments on SAM's performance are excellent and provide convincing evidence of the framework's robustness. Furthermore, providing the specific computational details has fully addressed my concerns about reproducibility.
-
Experimental Comparisons: Thank you for the clarification regarding the experimental setup. My initial suggestion was aimed at improving the focus of the results by de-emphasizing the Single-Image baselines. I acknowledge your commitment to reformat the results based on my suggestion, and I now consider this point resolved.
-
Promised Revisions: I acknowledge your commitment to revise the paper's clarity, add the missing citations, include a dedicated limitations section, and update the submission checklist. I trust these changes will be incorporated into the final version.
Overall, you have provided an excellent rebuttal that successfully addresses the critical issues. Based on these significant improvements, I will update my score to borderline accept and look forward to seeing the final version of the manuscript.
Many thanks for all the helpful comments and positive assessments.
We really appreciate reviewer 2GJ1 for upgrading the score. Thank you again for your valuable comments and time efforts. We will ensure that the well-articulated rationale of method, missing citations, and a dedicated limitations section are clearly presented in the final version to enhance clarity and completeness.
The paper introduces User-instructed Disparity-aware Defocus control, a framework that lets everyday photographers refocus still images after capture by simply pointing, boxing, or describing the region they want sharp. Leveraging the tiny stereo baseline of dual-pixel sensors found in many DSLR-style and smartphone cameras, UiD learns an invertible network that couples deblurring with reblurring in a single model. During training, it establishes a disparity-aware feature space and a set of spatially varying blur kernels so that the same kernels can be inverted to go from blurred to sharp or vice-versa. At test time, a user-prompted mask modulates the learned disparity features, enabling seamless refocusing that keeps selected regions crisp while realistically blurring others. Extensive experiments on public DP datasets and a new real-world DP benchmark show UiD achieves state-of-the-art deblurring quality, flexible and visually convincing refocus effects, and lower computation than prior DP or single-image methods.
优缺点分析
[Strengths]
- UiD accepts text, box, or point prompts and turns them into masks with Segment Anything, so anyone can selectively keep subjects sharp while blurring the rest after capture—no depth maps or manual masking needed.
- Its invertible disparity-aware network couples deblurring and reblurring with shared spatially varying kernels, delivering state-of-the-art PSNR/SSIM on DP5K and DPD-blur while using the smallest compute budget among dual-pixel methods.
[Weaknesses]
-
The paper’s main weakness is that its contribution is confined to dual-pixel imagery. There are also deblurring and defocus models [1, 2, 3, 4] that operate on stereo images, which are closely related. Whereas dual-pixel sensors supply only an ultra-small parallax of roughly ±0.5 px, a stereo camera can adjust its baseline to achieve disparities ranging from tens to hundreds of pixels, an advantage that improves both depth accuracy and motion estimation. While the paper rightly targets the typical mobile-device scenario, additional analysis is needed before the method can be extended to settings such as robotics or autonomous driving.
-
The method turns user clicks/boxes into masks with Segment Anything; any over- or under-segmentation directly skews the disparity modulation and produces visible artifacts, yet the paper offers no quantitative analysis of this failure mode.
[1] Pan, Liyuan, et al. "Joint stereo video deblurring, scene flow estimation and moving object segmentation." IEEE Transactions on Image Processing 29 (2019): 1748-1761. [2] Zhou, Shangchen, et al. "Davanet: Stereo deblurring with view aggregation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. [3] Sellent, Anita, Carsten Rother, and Stefan Roth. "Stereo video deblurring." Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14. Springer International Publishing, 2016.
问题
-
The paper’s main practical limitation is its DP-only scope. Showing either positive evidence of transfer or a principled explanation of the failure modes would greatly improve confidence that the approach is not narrowly engineered. If possible, could you show the results on a more general stereo dataset with larger disparities?
-
For an interactive refocus tool the user experience hinges on latency. Concrete numbers would substantiate the “smallest compute budget” claim. The paper highlights low FLOPs, but latency on mobile hardware is not reported. Can you provide inference time on smartphone SoC or Apple CPU?
局限性
"yes" However, the main text does not include a separate “Limitations” section; instead, potential improvements (such as the application of a diffusion framework in future research) are indirectly addressed in the Conclusion/Future Work.
最终评判理由
I have thoroughly reviewed the comments from all reviewers and the authors' rebuttal. I appreciate the authors for presenting various experimental results in such a short period, which has addressed most of my concerns.
格式问题
No major formatting issues noted.
1.Confined to Dual-Pixel imagery
Our method is primarily designed for defocus control in static scenes. It is worth noting that our focus is not limited to deblurring and reblurring separately, but rather unifying both processes to serve the purpose of refocusing. The works you mentioned, such as [1, 2], primarily address motion deblurring, which is indeed a different task from ours. It is universally known that the crux of refocusing lies in accurately estimating the target disparity. In theory, our method can be extended from dual-pixel to stereo matching, as the disparity estimation network ( network, mentioned in line 155) in our approach essentially functions as a stereo matching network. This network shares the similar architecture as those in articles [3, 4] without pretraining, and this structural prior theoretically would enable it to cover the larger disparities generated by stereo data. Considering that currently there is no dataset specifically designed for stereo image refocusing, we will consider optimizing this aspect in our future work.
2.Mobile Efficiency
Thank you for your question. Currently, we mainly focus on algorithm design, and have not yet supported deployment on mobile devices. We recorded the average inference time on a single A6000 GPU using our self-collected dual-pixel image pair dataset, with an image resolution of 1152 × 1792. The average time is SAM + refocusing = 0.45s + 0.13s = 0.58s.
3.Evaluation on SAM
The language-based SAM, when guided by specially designed prompts, can generate sufficiently robust masks even on deblurred images. Given that the original SAM model was trained with diverse data augmentation, it exhibits inherent robustness to mild noise or blur. To further improve mask quality in our implementation:
- For region of interest, we use language-style prompts: number + object color + object name + spatial position (e.g., a brown dog on the chair).
- For those still imperfect masks, we further apply post-processing techniques such as dilation or erosion operations for refinement.
To evaluate the mask quality from our adopted SAM, we manually annotate two objects for each image of our collected images as the groundtruth. We compare their average IOU metrics,
| 3-click (point and box) | language(only with object name) | language (number + object color + object name + spatial position) |
|---|---|---|
| 93.6 | 91.8 | 87.5 |
4. Limitation Section
Thanks. We would add the limitation section and potential improvements of diffusion framework in our final version.
Reference
[1] Joint Stereo Video Deblurring, Scene Flow Estimation and Moving Object Segmentation, TIP 2019
[2] Stereo Video Deblurring, ECCV 2016
[3] Hierarchical Neural Architecture Search for Deep Stereo Matching, NeurIPS 20
[4] Dual Pixel Exploration: Simultaneous Depth Estimation and Image Restoration, CVPR 21
I would like to thank the authors for their response to my questions and I will take the response into account when making the final rating. I do not have any further questions at the moment.
Thank you once again for your time efforts and suggestive comments.
we would be willing to provide further explanations or clarifications if you have any further questions. Thanks
The authors present UiD, a novel framework for user-instructed depth-of-field (DoF) manipulation. Users can use the system's point, box, or text prompts to refocus images. The main concept is to use disparity and defocus cues taken from DP image pairs to jointly model invertible deblurring and reblurring processes. Realistic refocusing using user-specified masks derived from SAM is made possible by the method's estimation of disparity-aware defocus kernels. When compared to multiple baselines, the paper shows better results in both synthetic refocusing and defocus deblurring.
优缺点分析
Strengths
- Novel formulation: Treats refocusing as an invertible problem that connects deblurring and reblurring in a mathematically grounded framework.
- Dual-pixel leverage: Effectively utilizes stereo disparity from DP sensors, offering better depth-aware refocusing.
- User control: Enables intuitive control using natural prompts, supported by prompt-based segmentation model.
Weaknesses:
- The framework requires dual-pixel (DP) input, which restricts deployment to DP-enabled cameras. It is unclear how the method performs on monocular or stereo-only inputs.
- UiD use SAM for segmentation masks but does not evaluate the impact of mask quality or conduct robustness tests on prompt errors or segmentation failures.
- While the method emphasizes user control and artistic intent, it lacks perceptual or user experience studies to validate subjective quality or usability.
问题
-
What happend if the mask produced by SAM is imprecise or unclear? Does the system generate invalid defocus regions or does it degrade gracefully?
-
How robust is your method to different DP hardware calibrations or disparities with high noise?
-
What is the need for invertibility? When compared to standard encoding-decoding, can you measure the improvement in results?
-
Is the method fragile to segmentation boundary errors or small masks?
-
Is performance degraded with imprecise or noisy user prompts?
局限性
- Assumes access to DP sensor data — not all phones/cameras offer this.
- Refocusing depends on SAM segmentation, but SAM is not evaluated as part of the system.
最终评判理由
I would like to thank the authors for their response to my questions, and I maintain my initial rating.
格式问题
Some references are duplicated (e.g., [3] and [2] appear to refer to the same work).
We sincerely thank your comments, and we give point-to-point below.
1. Consideration on Monocular or Stereo-only Inputs
Thanks. As the adopted disparity estimation network in our approach shares similar architecture with [1, 2] (stereo setting), therefore its structural prior theorectically could handle the stereo-only inputs that shares similar pattern with DP inputs. Regarding monocular setting that hard to estimate depth or disparity, I think a possible solution is to leverage a series of powerful pretrained model for depth or disparity, such as DepthAnything [3]. This can be further discussed and explored in our future work.
2. Evaluation on SAM
The language-based SAM, when guided by specially designed prompts, can generate sufficiently robust masks even on deblurred images. Given that the original SAM model was trained with diverse data augmentation, it exhibits inherent robustness to mild noise or blur. To further improve mask quality in our implementation:
- For region of interest, we use language-style prompts: number + object color + object name + spatial position (e.g., a brown dog on the chair).
- For those still imperfect masks, we further apply post-processing techniques such as dilation or erosion operations for refinement.
To evaluate the mask quality from our adopted SAM, we manually annotate two objects for each image of our collected images as the groundtruth. We compare their average IOU metrics,
| 3-click | language(only with object name) | language (number + object color + object name + spatial position) |
|---|---|---|
| 93.6 | 91.8 | 87.5 |
We would provide the failure visualization case in our final version.
3.User Experience Studies.
Thanks for your suggestion. We would include this consideration in our final version.
4. What happend if the mask produced by SAM is imprecise or unclear? Does the system generate invalid defocus regions or does it degrade gracefully?
It would degrade gracefully.
5. How robust is your method to different DP hardware calibrations or disparities with high noise?
Our method leverages the structural prior of the stereo matching network to automatically estimate disparity from the DP image pair (Eq. 7), guided by both deblurring and reblurring losses. This dual supervision encourages the model to focus on the downstream task, thereby enhancing its robustness to inherent noise.
6. What is the need for invertibility? When compared to standard encoding-decoding, can you measure the improvement in results?
The network's invertibility ensures that reblurring and deblurring can be jointly trained while sharing the blur kernel , better establishing one-to-one correspondence between disparity feature and blur kernel . Intuitively, an effective disparity-aware blur kernel should be capable of both restoring an image (deblurring) and blurring it (reblurring). This paves a crucial path for the subsequent refocusing, enabling accurate kernel retrieval for defocus control by modifying the disparity (refer to Fig. 7 for illustration). When we empiricially adopt a traditional encoder-decoder architecture, the correspondence between the kernel and disparity becomes difficult to learn effectively, leading to unintended control effects (if possible, we would provide results in the rolling stage.).
7. Is the method fragile to segmentation boundary errors or small masks?
Segmentation boundary errors would affect the edges of the refocused image, but the results generally follow reasonableness; Our model exhibits good robustness to small masks.
8. Is performance degraded with imprecise or noisy user prompts?
Yes. SAM may generate imprecise masks in response to noisy user prompts, which could have a certain impact on our refocusing process.
Reference:
[1] Hierarchical Neural Architecture Search for Deep Stereo Matching, NeurIPS 20
[2] Dual Pixel Exploration: Simultaneous Depth Estimation and Image Restoration, CVPR 21
[3] Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data, CVPR 24.
This paper proposes UiD, a User-Instructed defocus control framework that enables flexible Depth of Field (DoF) manipulation in images captured by dual-pixel (DP) sensors. The system allows users to specify refocusing regions via text, box, or point prompts, which are converted to masks using SAM. UiD features an invertible deblurring-reblurring architecture that learns spatially varying defocus kernels and disparity features from DP pairs, establishing a closed-form mapping between blur and disparity. This enables both restoration of all-in-focus images from blurred inputs and controlled reblurring for user-specified refocusing. The approach is validated through extensive experiments on standard datasets and a new self-collected DP dataset, demonstrating superior flexibility and quality in DoF manipulation.
优缺点分析
The strengths of this article are described as follows: Proposes and validates a mathematically-driven invertible deblurring-reblurring framework that achieves SOTA deblurring performance while significantly reducing computational costs, without relying on depth annotations. Beyond SAM-powered interactive refocusing, the work contributes a novel real-world DP dataset addressing data scarcity and demonstrates robust out-of-distribution generalization.
The weaknesses of this article are described as follows: The core framework heavily relies on existing "deblur-and-reblur" paradigms (e.g., RefocusGAN, DC²) without introducing significant methodological advancements. Furthermore, the user interaction mechanism depends entirely on off-the-shelf SAM for mask generation, with no task-specific modifications. This unmodified pipeline integration substantially dilutes the claimed contribution of "user-instructed" control. The paper claims to support text, box, and point prompts for specifying refocusing regions, yet fails to provide essential implementation details (e.g., text-to-mask conversion, point/box coordinate handling in defocused images) or experimental validation of these interactive features. The self-collected DP dataset lacks a detailed description, including the size and characteristics of the dataset. In addition, ablation studies are insufficient, such as the full range of different variants of the loss function, and exploration of the weights λ is also missing. There are also many formatting errors in the paper, for example, the expression in Eq. (13) does not match the above, there is a missing period at the end of the title of Table 2, there is a spelling error in “caluclate”, and there is a redundant period in “Figure. 6”, and so on.
问题
The framework presented in this paper lacks originality, and in order to demonstrate the novelty of the article, it is necessary to elaborate on the differences from existing methods as well as the implementation details that are fully elaborated. Although some quantitative results are provided, the qualitative results are not sufficiently presented to visually assess the performance of the UiD framework. In addition, the necessary ablation experiments and detailed implementation details should be added to show the effectiveness of the individual components. There are multiple formatting errors in the article that lead to overall confusion, and the entire text needs to be scrutinized and corrected to improve readability.
局限性
The author of this article has fully elaborated on the advantages of the proposed method, but lacks the necessary exploration of its limitations and any potential negative social impacts, which is not conducive to the development of subsequent work. Therefore, it is necessary to supplement the content on limitations to provide direction for future research.
最终评判理由
The rebuttal have addressed my main concerns, including novelty, method details, hyperparameter settings. Compared to existing deblur-and-reblur frameworks (e.g., RefocusGAN, ), the proposed method introduces the Reversible Block with Shared Kernel that strictly follows the mathematical formuation (Eq.10), and implements Unified Network Architecture to reduce the total number of parameters and improving training efficiency. This highlights the novelty. Thus, I update my score to borderline accept.
格式问题
There are indeed many formatting errors in the paper, including but not limited to the following:There are also many formatting errors in the paper, for example, the expression in Eq. (13) does not match the above, there is a missing period at the end of the title of Table 2, there is a spelling error in “caluclate”, and there is a redundant period in “Figure. 6”, and so on.
We sincerely thank your comments, and we give point-to-point below.
1. Novelty Highlight
Please kindly note lines 56–68. The core innovation of our work lies in the adoption of a unified invertible network that decomposes the defocus control task into reversible deblurring and reblurring subtasks. Since deblurring and reblurring are inverse processes, we establish an essential connection between them by constraining these two subtasks to share the same blur kernel but with different form (as formulated in Eq.10). Specifically:
- The forward process (deblurring) uses disparity feature to retrieve kernel for convolution.
- The reverse process (reblurring) uses disparity feature to retrieve kernel , and adopt its inverse kernel for convolution, where and denote the fourier transformation and its inverse transformation, respectively.
We also employ the invertible block (elaborated in suppl Tab.1) to better refine the representation. These two processes are trained collaboratively using loss in Eq.(13) .
Compared to existing deblur-and-reblur frameworks (e.g., RefocusGAN, ), our key improvements are:
- Reversible Block with Shared Kernel: We establish an invertible connection between deblurring and reblurring with shared kernel learning, which strictly follows the mathematical formuation (Eq.10).
- Unified Network Architecture: We reuse one network parameter to achieve refocusing task (deblur-and-reblur), significantly reducing the total number of parameters and improving training efficiency.
2. Off-the-Shelf SAM
- Note that our method does not require masks during training. Masks are only provided at test time, where the user inputs a binary mask (single-channel, , the same size with input image) to specify the region of interest for refocusing. Therefore, no task-specific modification would not affect the dynamics and robustness of our model training (train only once).
- The language-based SAM, when guided by specially designed prompts, can generate sufficiently robust masks even on deblurred images. Given that the original SAM model was trained with diverse data augmentation, it exhibits inherent robustness to mild noise or blur. To further improve mask quality in our implementation:
- For region of interest, we use language-style prompts: number + object color + object name + spatial position (e.g., a brown dog on the chair).
- For those still imperfect masks, we further apply post-processing techniques such as dilation or erosion operations for refinement.
To evaluate the mask quality from our adopted SAM, we manually annotate two objects for each image of our collected images as the groundtruth. We compare their average IOU metrics,
| 3-click | language(only with object name) | language (number + object color + object name + spatial position) |
|---|---|---|
| 93.6 | 91.8 | 87.5 |
We acknowledge that the quality of refocusing results may be influenced by the quality input mask, we would provide the failure case in our final version.
3. Implementation Details for SAM
For language-based mask generation:
- The LLM tokenizer converts user textual prompts into tokens.
- The global token
[seg]computes similarity with image features to generate the segmentation mask.
For click-based interactive mask generation (point or box inputs):
- Each click coordinate is transformed into a heatmap (the region with click has large score).
- This heatmap is fused with the previously generated mask to iteratively refine the segmentation result. More click would further iteratively improve the quality of the generated mask.
We will provide further technical details in the appendix of the final submission.
4. Detailed Descriptions of Self-Collected DP Dataset
Please refer to Appendix. Additionally, our self-collected DP dataset consists of images with a resolution of 6720 × 4480 pixels. Additional characteristics will be added in the supplementary material of final version.
5. Hyperparameter Settings
Thanks for your suggestion. Considering the reversible nature of deblurring and reblurring, we assume equal importance to the deblurring loss and reblurring loss . The gradient regularization term is a widely used auxiliary loss in image restoration tasks [1], and is not a primary contribution of our work; hence, we did not emphasize its analysis initially.
In response to your suggestion, we conduct additional ablation studies on the impact of different hyperparameters. Results on PSNR (dB) are summarized below:
Ablation on :
| deblur | reblur | |
|---|---|---|
| 0.3 | 26.84 | 29.11 |
| 0.5 | 26.89 | 29.18 |
| 1.0 | 26.82 | 29.13 |
Ablation on :
| deblur | reblur | |
|---|---|---|
| 0.3 | 26.86 | 29.16 |
| 0.5 | 26.89 | 29.18 |
| 0.7 | 26.74 | 28.86 |
| 1.0 | 26.73 | 28.86 |
We will include these ablation studies in the experimental section of the final manuscript.
6. Formatting Errors
We appreciate your feedback and will carefully revise all formatting errors in the manuscript accordingly.
Reference:
[1] Structure-Preserving Super Resolution with Gradient Guidance, CVPR 2020
Thank you once again for your valuable feedback and the time efforts.
We would like to kindly inquire whether our responses have adequately addressed your concerns. Should you have any further questions or require additional clarification, we would be more than happy to provide further explanation.
All four reviewers recommend acceptance (three borderline). The paper presents a user-friendly framework for Depth of Field manipulation in images from dual-pixel sensors. The strengths include the novel user interaction via the Segment Anything Model, the mathematically-driven invertible deblurring-reblurring framework, and the creation of a new real-world dataset. The main weaknesses are the reliance on existing paradigms and off-the-shelf components (like SAM), and the limitation to dual-pixel imagery. The authors' rebuttals were well-received and addressed the reviewers' concerns.