PaperHub
5.0
/10
Rejected4 位审稿人
最低5最高5标准差0.0
5
5
5
5
3.5
置信度
正确性2.8
贡献度2.0
表达2.8
ICLR 2025

SeeThruAnything: Learning to Remove Any Obstructions Across Distributions

OpenReviewPDF
提交: 2024-09-13更新: 2025-02-05
TL;DR

A universal model that can remove any obstructions.

摘要

关键词
Obstruction RemovalZero-shotPrompts

评审与讨论

审稿意见
5

The paper introduces a zero-shot obstruction removal framework to handle both seen and unseen obstructions in images. The idea is to formulate obstruction removal as a soft-hard mask restoration task, leveraging multi-modal prompts to enhance generalization. The framework incorporates a tunable mask adapter that dynamically refines inaccurate masks during the restoration process. The authors show that their method achieves superior performance over state-of-the-art techniques across a wide range of both in-distribution and out-of-distribution obstructions, demonstrating its flexibility and robustness across diverse occlusions.

优点

  1. The introduction of the soft-hard mask prediction task intuitively enhances the model’s generalization ability, making it more adaptable to various obstruction types.
  2. The paper conducts thorough experiments to validate the effectiveness of the multiple components of the framework.

缺点

  1. While the generalization largely stems from the mask prediction process, the paper lacks a detailed analysis of the quality and generalizability of the predicted masks. Are there any quantitative metrics to evaluate mask quality on both seen and unseen objects?
  2. There is no comparison with in-painting methods in experiments. It would be valuable to see a comparison with more recent diffusion-based in-painting methods. For example, [1,2].
  3. The performance on seen categories is not consistently superior to prior works.

[1] Grechka, Asya, Guillaume Couairon, and Matthieu Cord. "GradPaint: Gradient-guided inpainting with diffusion models." Computer Vision and Image Understanding 240 (2024): 103928.

[2] Lugmayr, Andreas, et al. "Repaint: Inpainting using denoising diffusion probabilistic models." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.

问题

  1. In cases where an image contains multiple obstructions (e.g., raindrop and power cable), how does the model handle prompts to remove only one type of obstruction? Can it selectively remove the specified obstruction without affecting others?
  2. In Sec 5.1, the patch sizes mentioned (128, 160, 192, 256) seem unreasonable large. Should this refer to the image resolution instead?
  3. Is the model capable of handling obstructions that exhibit significant differences from seen obstructions, such as in the case of reflection elimination? Can the authors provide some visualizations with existing datasets or real-world photos?
  4. How do the authors initialize the model? Are pre-trained weights beneficial?
  5. What is the model size, and how does it perform in terms of inference speed?
评论

We sincerely appreciate Reviewer tqQw's valuable feedback. To provide a comprehensive overview and visual evidence of the improvements, we have included the updated manuscript in the supplementary materials. We kindly invite you to review this revised version, which thoroughly documents all new comparisons and analyses.

Response-W1: Thank you for your comment. We believe there may be a misunderstanding regarding the source of our method’s generalization capability. The generalization performance does not stem primarily from the quality of the predicted masks but from our proposed formulation, which uniformly represents the complex task of obstacle removal. Even when provided with ground truth obstacle masks, many existing methods fail to remove obstructions, especially those unseen during training. This is evidenced by the visual comparison of rare obstacle removal in Figure 6 and the results of inpainting-based methods using masks, as detailed in Appendix C.3.

The effectiveness of our model lies in its ability to handle diverse obstructions through the introduction of soft and hard recovery strategies, which address both opaque and semi-transparent obstacles. Additionally, the use of multi-modal prompts allows for precise representation of obstacles, further enhancing the model’s capabilities. The mask detector we use is an off-the-shelf component and not the primary focus of this paper. However, our model is plug-and-play, meaning it can seamlessly integrate with more advanced mask detectors in the future to further improve performance.

Response-W2: Thank you for your comment. We initially included a comparison with inpainting-based methods in Figure 1 of the original draft. However, as discussed in the Introduction, inpainting-based methods, while capable of producing visually plausible results, often fail to faithfully reconstruct the original image content. This lack of fidelity was the primary reason we did not include a detailed comparison with inpainting methods in the initial draft, as our focus is on methods that preserve the authenticity of the original scene.

To address this concern and provide a more comprehensive evaluation, we expanded the comparison in Appendix C.2, focusing on zero-shot recovery performance for unseen obstacles. Since the code for GradPaint [1] is not publicly available, we included LaMa [3] and Repaint [2] as competitors. The results, presented in Table 5 and Figure 12, demonstrate that our method outperforms these inpainting-based methods both quantitatively—across three classic obstacle removal tasks—and visually, particularly for less common obstacles.

Table 5: PSNR and SSIM comparisons of our method with inpainting-based methods on unseen obstructions. The best results are highlighted in bold.

MethodVenueRain Streak PSNRRain Streak SSIMSnow PSNRSnow SSIMStroke PSNRStroke SSIMAverage PSNRAverage SSIM
LaMaWACV2229.070.885832.320.910828.100.872829.830.8898
RePaintCVPR2228.780.886532.200.906423.780.805928.250.8662
SeeThruAnything29.820.890734.850.928329.450.906731.370.9086
  • [1] Grechka A, Couairon G, Cord M. GradPaint: Gradient-guided inpainting with diffusion models, CVIU, 2024.
  • [2] Lugmayr A, Danelljan M, Romero A, et al. Repaint: Inpainting using denoising diffusion probabilistic models, CVPR 2022.
  • [3] Suvorov, Roman, et al. Resolution-robust large mask inpainting with fourier convolutions,WACV, 2022.

Response-W3: Thank you for your observation. Our method is specifically designed to address the challenges posed by unseen obstructions, which may explain why its advantages are less pronounced for known obstructions. Nonetheless, comprehensive comparisons across the three datasets demonstrate that our method achieves consistently superior overall performance.

Moreover, the comparable performance on both seen and unseen categories highlights a key strength of our approach: the distribution-agnostic obstruction formulation. By treating all obstructions uniformly as a transparency recovery problem, our method ensures consistent and robust performance across diverse scenarios, regardless of whether the obstructions were encountered during training.

评论

Response-Q1: Thank you for your insightful question. To address this, we have included results on multi-type obstruction removal in Appendix C.2 to demonstrate the model's capability. Leveraging the multi-modal prompts in our method, we can accurately target and remove the specified obstruction (e.g., raindrops or power cables) without affecting other image content. This selective removal is clearly illustrated in Figure 11. For further details and analysis, we invite you to review the revised manuscript.

Response-Q2: You are correct—"patch size" in Sec 5.1 refers to the dimensions of the training image blocks, a term commonly used in traditional practices. It can indeed be understood as the resolution of these blocks.

Response-Q3: Thank you for your question. As noted in the limitations section, our method is currently not designed for obstacles covering large areas, such as specular reflections, which lack sufficient contextual information. Addressing this issue is one of our future research directions.

However, we have demonstrated the model’s strong zero-shot generalization ability on various obstacles significantly different from the training samples, such as spots, scratches, and power cables in Figure 6, rain streaks, snow, and strokes in Figure 9, and shadows and watermarks in Figure 10. These visualizations highlight the robustness of our approach in handling diverse unseen obstructions.

Response-Q4: Since our model and the baseline model differ in terms of the task and input format, using pre-trained weights yields results comparable to our initialization method. Therefore, we directly initialize the model using Kaiming normal distribution.

Response-Q5: Our model has 56.69 M parameters and an inference speed of 84.28±0.61 ms. You can see Appendix F for a more detailed analysis.

Table 7: Comparisons of parameters, FLOPs, and runtime between.

ModelVenueParameters (M)FLOPs (G)Runtime (ms)
RestormerCVPR2226.13118.6049.37±0.46
TransWeatherCVPR2238.063.5719.64±0.05
PromptIRNeurIPS2335.59132.2653.95±0.47
WGWSNetCVPR234.7096.6588.39±0.35
HistoformerECCV2416.6286.7983.13±0.82
XRestromerECCV2422.34155.49100.67±0.44
SeeThruAnything56.69146.2384.28±0.61
评论

Thank the authors for their detailed response. While some of my concerns have been effectively addressed, I have one additional question. Regarding W1, I would expect to see a comparison highlighting the zero-shot ability of the mask detector. For instance, how does the performance (e.g., IoU) of training on obstacles set A and testing on obstacles set B compare to direct training and testing on obstacles set B? Such results would help reinforce the zero-shot claim. Additionally, a follow-up question arises: if we change set A, how sensitive is the model to the choice of A? There is no need for an exhaustive grid search, one or two representative results would suffice to provide valuable insight.

评论

Thank you for your follow-up question. We understand your concerns regarding mask detectors; however, we would like to clarify that designing a zero-shot mask detector is not the focus of our work. This is a distinct research area explored in studies such as referring segmentation, with recent advancements like Grounded SAM 2 showcasing powerful zero-shot capabilities. Our emphasis lies in understanding transparent obstructions and recovering the occluded transparent context beneath them. To ensure a fair evaluation of recovery capabilities, all methods in our comparisons use the same masks.

Except for the three basic obstructions used for training, all our experiments are designed to demonstrate zero-shot performance, effectively generalizing to diverse unseen scenarios. Simply put, our approach trains on obstructions A, B, and C, and tests on obstructions D through Z. This highlights the robustness of our distribution-agnostic obstruction formulation and its capability to handle recovery tasks across a wide range of obstructions. We hope this clarification underscores the focus of our work and addresses your concern.

审稿意见
5

This paper proposes a new Obstruction Removal method SeeThruAnything to reconstruct a clear original image given a degraded image and the estimated occlusion mask as input. To deal with different obstructions with or without ambiguous boundaries, SeeThruAnything utilize a transformer-based tunable adapter to convert hard masking to soft masking and use different maskings for different obstructions during inference. To better recover the original clean image, this paper also utilizes CLIP to extract multi-modal information from corrupted images and text commands like "remove semi-transparent obstructions" as a condition for their network. A cross-attention is used to inject this multi-modal information into their model. The proposed method obtains competitive performance compared to SOTA on seen obstructions and SOTA performance on unseen obstructions.

优点

  1. The proposed method is quite simple and well motivated, which has a potential to become a common baseline for future works in this field.
  2. The proposed method obtains competitive performance on seen obstructions and SOTA performance compared to previous methods.

缺点

  1. The proposed tunable mask detector seems to be heavy. It would be best to mention the number of parameters and the flops for your method and the compared method so that we can distinguish the performance improvement brought by the increasing parameters.
  2. The proposed method use the corrupted images with obstructions removed as input. The obstructions are removed according to inaccurate estimated obstruction masks. However, previous works mainly take degraded images with unremoved obstructions as input. There is no ablation study to prove the advantage of your design.
  3. The images and texts in Figure 1 might be too small. It is difficult to distinguish the comparison in figure 1.

问题

  1. As the method is mainly tested on synthetic corrupted data, how does it perform for images with multi-type obstructions?
  2. What mask detector is used? Is it the same one for compared methods?
  3. What is the exact text command used in your method? Do you use different text commands for different obstructions? How does it perform when only using some consistent text commands like "remove obstructions".
评论

We sincerely thank Reviewer AeWH for their valuable comments. To provide a comprehensive overview and visual evidence of the improvements, we have included the updated manuscript in the supplementary materials. We kindly invite you to review this revised version, which thoroughly documents all new comparisons and analyses.

Response-W1: Thank you for your observation. To address this, we have included a detailed comparison of model parameters, FLOPs, and runtime in Appendix F. While our model has a slightly higher number of parameters compared to the baselines, the FLOPs and runtime—which are more reflective of computational efficiency—remain within a practical range. Notably, the increase in parameters primarily arises from the introduction of the cross-attention module for integrating multi-modal prompts, not from the tunable adapter itself. Specifically, the adapter contributes only 0.2176 million additional parameters and 11.08 giga FLOPs, ensuring that its impact on computational cost is minimal.

Table 7: Comparisons of parameters, FLOPs, and runtime between.

ModelVenueParameters (M)FLOPs (G)Runtime (ms)
RestormerCVPR2226.13118.6049.37±0.46
TransWeatherCVPR2238.063.5719.64±0.05
PromptIRNeurIPS2335.59132.2653.95±0.47
WGWSNetCVPR234.7096.6588.39±0.35
HistoformerECCV2416.6286.7983.13±0.82
XRestromerECCV2422.34155.49100.67±0.44
SeeThruAnything56.69146.2384.28±0.61

Response-W2: Thank you for your comment. We would like to clarify that we conducted this ablation study in our original submission, and the results are presented in Table 3. To make this clearer, we have highlighted this section in the revised manuscript and added explanatory footnotes to improve readability and ensure the results of our ablation study are easily understood.

Table 3: PSNR and SSIM comparisons of integrating different modules.

maskCAAdapterPSNRSSIM
27.050.8920
28.050.9004
30.000.9117
30.930.9250

Response-W3: Thank you for your suggestion, we have reorganized Figure 1 for better comparison.

Response-Q1: Thank you for your insightful question. We have conducted experiments on images with multiple types of obstructions, as detailed in Appendix C.2 and illustrated in Figure 11. By leveraging multi-modal prompts to represent various obstructions, SeeThruAnything effectively removes the specified obstacles with high accuracy. Additional results and in-depth analysis are provided in the revised manuscript for your review.

Response-Q2: Thank you for your question. The mask detector used in our method is described in detail in Appendix A.1 of the initial draft. In the revised version, we have highlighted this section in blue for clarity. To ensure a fair comparison, the same mask inputs are used across all experiments for both our method and the comparative methods.

Response-Q3: Thank you for your question. Examples of the text commands we use are provided in Figure 8 and Figure 14, ranging from simple descriptive phrases to more complex sentences. We typically use different text commands to specify whether an obstacle is opaque or semi-transparent. A generic command like "remove obstruction" may not yield optimal results, as our method interprets it as referring to opaque obstacles and performs hard mask recovery accordingly. However, by including transparency-specific descriptions in the input text, such as "opaque yarn" or "semi-transparent obstacles," our method achieves significantly better results.

评论

I acknowledge I have read the responses from authors and discussions among other reviewers. Most of my concerns are addressed. For Q2, how to use SAM2 to detect mask of obstructions is not mentioned. For Q3, the contrastive finetuning of CLIP text encoder may need further justification. How the finetuning benefits the performance is not mentioned. And authors could consider to try using GPT4 to generate more command samples for finetuning. And I am also concerned about the generalizability of the proposed method. The limitations for handling large area of occlusions may indicate the limited generalizability of soft masking.

评论

Response:

Thank you for your thoughtful comments. We are glad to hear that most of your concerns have been addressed. Below are our responses to the remaining points you raised:

  1. SAM 2 Usage: We use SAM 2 in two modes. In automatic mode, GroundingDINO detects obstructions, and SAM 2 generates masks. In manual mode, masks are created based on user-provided input points or bounding boxes. This detail was not initially included because all competitors use the same masks, and our primary focus has been on evaluating removal performance. However, we will include these details in the revised manuscript for clarity.

  2. CLIP Text Encoder Fine-tuning: We utilized ChatGPT-4o to generate 3,984 diverse text commands for fine-tuning which is equal to the number of our training images, ensuring robust training. Fine-tuning provides two key benefits: (1) It reduces the semantic distance between user-generated commands and core commands, enabling the model to better interpret user intent. (2) It enhances the model's ability to identify when to apply soft masking, improving semi-transparent obstruction removal. For instance, before fine-tuning, the cosine similarity (with softmax) between a user command like "There are raindrops in the image, please remove them" and the core commands ("remove opaque obstructions" and "remove transparent obstructions") was 0.5374 and 0.4626, respectively. After fine-tuning, these values improved to 0.00004 and 0.99996, demonstrating significantly improved semantic alignment.

  3. Soft Masking Generalizability: We acknowledge that handling large-area occlusions remains challenging due to the significant loss of underlying visual semantics, which are critical to our recovery formulation. This limitation is also shared by inpainting-based methods. However, our approach represents a novel attempt at unseen obstruction removal, with extensive experiments demonstrating strong generalizability across diverse unseen obstructions. The category-agnostic obstruction formulation and unified model design establish a promising foundation for addressing such challenges. We appreciate your suggestion and will explore ways to enhance performance for large-area occlusions in future work.

We hope these responses clarify your concerns. If they do, we kindly request you consider adjusting your score to reflect this. Thank you again for your constructive feedback.

评论

Sorry I just have another point requiring clarification. For compared methods, do you use the same masked images as input? As you mention in the paper, previous methods are not trained with masked images as input. Do you retrain all compared methods? I notice Histoformer(ECCV 2024) claims their performance for raindrop removal is 33.060 PSNR and 0.9441 SSIM. However, in your table 1, its performance is 31.59 PSNR and 0.9614 SSIM. Can you clarify this difference?

评论

Response:

Thank you for your comments. To ensure a fair comparison, we provided the same mask inputs for all competing methods and retrained them on the same dataset we used, thereby granting them zero-shot capability (discussed on Page 9, footnote 1).

Regarding the concern about the raindrop dataset, we utilized a dataset different from Histoformer. Specifically, we employed the VRDS dataset [1], a more recent and advanced dataset for raindrop removal, while Histoformer used the Raindrop dataset [2]. Additionally, Histoformer trains separate weights for each obstacle type, whereas we retrained all competing methods as single general models to evaluate their universality. These differences account for the variations in performance metrics. We hope this explanation clarifies our methodology. Thank you again for your valuable feedback.

  • [1] Wu H, Yang Y, Chen H, et al. Mask-Guided Progressive Network for Joint Raindrop and Rain Streak Removal in Videos. MM, 2023.

  • [2] Qian R, Tan R T, Yang W, et al. Attentive Generative Adversarial Network for Raindrop Removal from a Single Image. CVPR, 2018.

评论

Thanks for your quick response! I have a follow-up question regarding my initial W2. When you retrain the compared methods, do your masked input also produce a better result? As you have proved the masked input work for your methods, it would be interesting if this masking strategy also works for the existing methods.

评论

Thank you for your follow-up question. Our distribution-agnostic obstruction formulation is inherently tied to the use of multi-modal prompts, which work together to enable the effectiveness of our method. Without these components, our method essentially reduces to a Restormer, as demonstrated in our ablation studies.

This observation also answers your question: incorporating our components—such as the distribution-agnostic formulation and multi-modal prompts—can potentially improve the performance of other baseline methods as well. We will explore this aspect further in the final revision.

Thank you again for your insightful question!

审稿意见
5

The article presents SeeThruAnything, a novel zero-shot framework designed to effectively remove various types of obstructions in images. SeeThruAnything employs multi-modal prompts—combining visual and textual inputs—processed through a cross-attention unit for enhanced contextual understanding. It also features a tunable adapter for mask adjustments. Extensive experiments demonstrate that SeeThruAnything excels in both familiar and unfamiliar obstacle scenarios, showcasing strong performance and generalization capabilities in obstruction removal tasks.

优点

  1. The paper is well-written and easy to follow.
  2. The paper demonstrates that SeeThruAnything is highly effective in removing obstacles, particularly in generalizing to invisible obstacles outside the training distribution.
  3. The paper conceptualizes obstacle removal as a problem of soft and hard mask recovery, offering significant insights into the future research directions of this field. By integrating visual tokens with text tokens, the model’s capacity for generalization in open-world scenarios is substantially enhanced.

缺点

  1. The technical contribution of the paper is limited. The use of multi-modal prompts and mask recovery techniques, although effective, may not significantly depart from established methodologies, suggesting a reliance on existing concepts rather than groundbreaking innovations.
  2. Generalization Limitation. While SeeThruAnything demonstrates the capability to remove unseen obstacles, these obstacles are often fundamentally similar in nature (e.g., raindrops and rain streak, fences and yarn). This is underscored by the observation that the performance of WGWSNet and PromptIR on rain streaks and strokes is nearly comparable to, or even surpasses, that of SeeThruAnything.

问题

As you mentioned, the original configuration of other methods cannot achieve zero-design tasks. How do you give them this ability?

评论

We sincerely appreciate Reviewer UqLU's valuable feedback. To provide a comprehensive overview and visual evidence of the improvements, we have included the updated manuscript in the supplementary materials. We kindly invite you to review this revised version, which thoroughly documents all new comparisons and analyses.

Response-W1: Thank you for your feedback. We believe this concern may stem from a partial or shallow understanding of some aspects of our techniques. Our method introduces a novel distribution-agnostic obstruction formulation that unifies obstruction removal as a single framework, treating all obstructions as a soft masking problem. This formulation focuses on understanding the underlying context of transparent obstructions, which is critical for effective removal.

We demonstrate that even with a ground truth mask, the inherent distribution of different obstructions cannot be easily handled using traditional approaches. Our approach addresses this limitation by enabling zero-shot removal through a flexible soft mask recovery strategy, which generalizes across diverse obstruction types, including unseen scenarios. This perspective has not been explored in prior work and significantly advances both the theoretical foundation and practical application of obstruction removal.

Response-W2&Q1: Thank you for your comments and questions. We believe some aspects of your concerns might stem from a misunderstanding. The original versions of WGWSNet and PromptIR are not designed for zero-shot tasks and are largely ineffective for unseen obstructions. To ensure a fair comparison, we standardized the inputs for all methods, including the same degraded images and masks, effectively giving these methods some degree of zero-shot removal capability. While their complex designs allow them to perform comparably to ours in specific unseen scenarios with distributions similar to their training data, our method consistently outperforms them. The PSNR and SSIM comparisons across three unseen scenarios (Table 2) highlight our model's advantages, with a PSNR lead of 1.88 dB over the second-best method.

Regarding the similarity between fences and yarn, this appears to be a misconception. Fences exhibit regular patterns and predictable distributions in the image domain, while rare obstructions like yarn have random distributions and highly irregular patterns, making them much harder to remove. This difference is clearly demonstrated in the results shown in Figure 6.

Our approach addresses these challenges through a novel soft and hard mask strategy that understands the underlying context of diverse obstructions, enabling effective removal even in complex or unseen scenarios. Competing methods struggle in these cases, which underscores the unique generalization capability of our model. We hope this clarification resolves any misunderstanding and highlights the strengths of our proposed method.

评论

Thank you for your reply, which has helped me understand the author's intention to some extent.

However, I still have some confusion about W2 and Q1. WGWSNet is an image restoration method designed for weather conditions, applicable to both general and specific weather scenarios through a two-stage approach. PromptIR is more closely aligned with this paper, which restore images with few lightweight prompts. While I know the difference between these methods, the authors do not clarify how these two methods achieve zero-shot removal capabilities. If the original versions of these methods are largely ineffective for unseen obstacles, then merely utilizing the same masks and degraded images would not be sufficient to acquire this ability. The author should clarify this to ensure a fair comparison.

Moreover, the performance shown in Table 2 does not demonstrate a significant improvement compared to the other methods, in some cases, it is even less effective. Additionally, the proposed method has a larger number of parameters than other methods, and the performance gains may be largely due to this, which further weakens the claim of "unique generalization ability."

评论

Response:

Thank you for your comment. We believe there may still be some misunderstanding regarding the comparison and contributions of our work.

While adding masks as input to all competing methods provides limited zero-shot capability, the key difference lies in how unseen obstructions are handled. Methods like PromptIR and WGWSNet are fundamentally not designed for generalization to unseen obstructions. Specifically, PromptlR relies on predefined degradation embeddings (which can be updated during training) and estimates probabilities that enhance feature representations. This approach is inherently limited to predefined degradation types and does not generalize well to unseen scenarios. WGWSNet, on the other hand, employs a two-stage approach: the first stage trains a general model, while the second stage trains multiple parallel pathways tailored to specific degradation types. During inference, each degradation type requires its own pathway in addition to the general model.

This limitation is evident when addressing rare or complex obstructions, as shown in Figure 6, where unseen obstructions differ significantly from those seen during training. In contrast, our distribution-agnostic obstruction formulation enables robust zero-shot generalization across a wide range of scenarios.

We also respectfully disagree with the suggestion that our method does not demonstrate significant improvement. In unseen scenarios, our method achieves an average PSNR gain of 2–7 dB over competing methods, as shown in Table 2. Such performance improvement is considered substantial in the field of image processing, particularly for challenging unseen cases. The closer performance observed for rain streak removal can be attributed to the similar distribution of rain streaks to seen types like raindrops, which aligns more closely with the training data of competing methods. This contrasts with the other two unseen types, where their performance is significantly worse, underscoring their limited ability to generalize effectively.

Lastly, while our model has the highest parameter count due to the inclusion of multi-modal inputs, we emphasize that this is not the primary driver of our performance improvement. As demonstrated in Table 7, FLOPs and runtime comparisons show that our method is computationally comparable to existing models, with inference times remaining within an acceptable range. The performance gains stem from our distribution-agnostic formulation design, not from the increased parameter count.

We hope this clarification addresses your concerns and highlights the strengths of our approach. Thank you again for your valuable feedback.

审稿意见
5

The paper introduces an in-painting method to remove real-world obstructions from images. The method uses multimodal prompts from a pretrained CLIP model as conditioning to the in-painting transformer model and shows good improvement over prior art.

优点

The paper is well written, the method and architecture is clearly explained and tested on a wide range of benchmarks. The proposed method shows good improvement over recently published methods in the domain.

缺点

  1. The paper does not include any examples of textual prompts. The only examples are in Figure 1 and 3, CLIP text encoder is not explicitly trained on instructions like "Remove the semi-transparent obstruction", and the image-text datasets used to train CLIP models typically have captions describing the foreground which may or may not describe the type of occlusions. It is unclear how the embedding space of CLIP's text encoder is capable of embedding such instructions.
  2. The paper does not provide details on CLIP model used for generating multimodal prompts.
  3. In Table 4. ablation does not include "visual prompt only" setting. An interesting ablation would be to use different text embedding models apart from CLIP.

问题

My main concern with this work is the use of CLIP text encoder, as it is typically not trained on textual instructions as depicted in Figure 1 and 3. is it possible that any text encoder would work in this setup? Also, ablation on textual prompts would be good to have, i.e. what level of detail is necessary in the prompt to achieve a good in-painting result. I am willing to update the score based on the response to above questions.

评论

We sincerely appreciate the insightful comments provided by Reviewer Ms8C. To offer a detailed account and visual evidence of the improvements, we have included the updated manuscript as part of the supplementary materials. We kindly invite you to review this revised version, which thoroughly documents all new comparisons and analyses.

Response-W1: Thank you for raising this important point. Prior to training the SeeThruAnything model, we fine-tuned the CLIP text encoder to adapt its embedding space to our specific task. The fine-tuning process aimed to optimize the embedding space by reducing intra-class distances (i.e., decreasing the cosine distance between embeddings of different textual descriptions for the same task) while increasing inter-class distances (i.e., enhancing the separation between embeddings of textual descriptions for different tasks). This adaptation ensures that the CLIP text encoder can effectively handle task-specific instructions. As this is a relatively small adjustment, it was not included in the initial draft. However, we have provided a detailed explanation of this technique in Appendix B of the updated manuscript for your reference.

Response-W2: Thank you for your suggestion. To improve the model's reproducibility and make it easier for readers to understand, the usage details about CLIP have been added to the Appendix B of the updated version.

Response-W3: Thank you for your insightful suggestions. We have added the ablation experiment using only the visual prompt to Table 4 for completeness. Additionally, we explored replacing the CLIP text embedding model with the BLIP model, and the results show comparable performance to our final setup. Detailed metrics and analysis can be found in Appendix D.2 and Table 6.

Initially, we did not include the "visual prompt only" setting because using additional images as input deviates from the primary goal of leveraging text-based instructions, which are more direct, interpretable, and practical in real-world scenarios. Such a setting introduces complexity that may limit the applicability and flexibility of the model. However, we appreciate the opportunity to address this point in the revised manuscript.

Table 4: PSNR and SSIM comparisons of using different prompt strategies.

Textual PromptVisual PromptPSNRSSIM
28.650.9063
29.730.9168
30.250.9215
30.930.9250

Table 6: PSNR and SSIM comparisons of using different prompt generation strategies.

ModelPSNRSSIM
SeeThruAnything + CLIP30.930.9250
SeeThruAnything + BLIP31.010.9235

Response-Q1: Thank you for raising this important concern. We believe this issue stems from a misunderstanding, which we have now clarified in Appendix B. Specifically, the CLIP text encoder was fine-tuned to better adapt to task-specific instructions, and its role within our framework is clearly outlined. To address the question of generalizability, we conducted additional experiments replacing the CLIP text encoder with the BLIP model, as detailed in Appendix D.2. The results confirm that our framework is not restricted to CLIP and can adapt to other commonly used pretrained text encoders.

Regarding textual prompts, their primary purpose in our approach is to specify the type of obstacle to be removed, rather than providing complex or detailed scene descriptions. This allows for robust performance without requiring intricate prompts. Examples illustrating this simplicity and effectiveness can be found in Figure 8 and Figure 14.

We hope this clarification resolves the misunderstanding and highlights the robustness and flexibility of our approach. In light of this additional evidence and explanation, we kindly request the reviewer to reconsider their rating.

评论

Dear Reviewers,

We sincerely appreciate your valuable feedback and the critical concerns raised. We believe some of these concerns stem from misunderstandings, which we have addressed thoroughly in the revised manuscript. Specifically, our distribution-agnostic obstruction formulation, which unifies obstruction removal tasks, is the key innovation enabling generalizability to both seen and unseen obstructions. This formulation is not only impactful for obstruction removal but also holds significant potential for broader applications in image recovery tasks. We kindly invite you to review the clarifications and improvements in our revised manuscript and hope you will reconsider your ratings in light of these contributions.

Thank you for your time and thoughtful consideration.

Sincerely,

The Authors

评论

Dear Reviewers,

Thank you for your thoughtful comments and feedback. We have provided detailed responses to address all concerns, including clarifications and revisions. We would greatly appreciate it if you could reconsider your rating to reflect the improvements and explanations provided in our replies.

Thank you for your time and consideration.

Sincerely,
The Authors

AC 元评审

All the reviewers provided the negative ratings. Although the paper has some merits, e.g., competitive results, the reviewers pointed out a few critical concerns about 1) technical contributions compared to the prior work, 2), technical clarity like used prompts and details of CLIP text encoder training, 3) generalization limitation and performance on seen cases. After taking a close look at the paper, rebuttal, and discussions, the AC agrees with reviewers' feedback and hence suggests the rejection decision. The authors are encouraged to improve the paper based on the feedback for the next venue.

审稿人讨论附加意见

In the rebuttal, some of the concerns like technical clarity are explained by the authors. However, the generalization issue raised by reviewer UqLU, AeWH, and tqQw is not fully addressed in the post-rebuttal period. The AC agrees with the three reviewers that this can be still significantly improved in the manuscript since one of the paper's main focuses is on the zero-shot removal capability.

最终决定

Reject