PaperHub
5.5
/10
Rejected4 位审稿人
最低3最高8标准差1.8
8
6
3
5
4.3
置信度
ICLR 2024

IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks

OpenReviewPDF
提交: 2023-09-22更新: 2024-02-11
TL;DR

Unlocking multimodal prompting of image inpainting models for computer vision tasks

摘要

关键词
visual promptimage inpaintingin-context learningmultimodal prompt

评审与讨论

审稿意见
8

The paper explores in-context visual learning with multimodal prompting, i.e., automatically completing corresponding dense vision tasks according to either visual prompts, text prompts or a combination of both. To achieve this, the authors collected a new dataset of figures in computer vision papers and the associated caption, from the Semantic Scholar website. Based on the previous method MAE-VQGAN, the authors add text as input and use cross-attention layers to fuse the text tokens and image tokens. The model achieves higher performance than the previous self-supervised visual prompting method, and can follow both textual and visual prompts for novel tasks.

优点

  • In-context visual learning is a very interesting and significant topic. This paper presents new capabilities of multimodal prompting along this direction.
  • The proposed Semantic Scholar Computer Vision dataset (S2CV) provides new angles to leveraging unlabeled in-context data with both text and images. The dataset can be useful for future research.
  • The method is simple and straightforward, not novel but can be easy to follow and adapt.
  • The papers show many qualitative cases and comparisons in studying the properties of the proposed IMProv model, e.g., comparison with stable diffusion, which provides a lot of insights.

缺点

  • It would be more convincing to have more quantitative experiments on more tasks, e.g., one-shot or few-shot tasks. Although it might be challenging, it will give readers a bigger picture.
  • What about using multiple prompts? For example, multiple images, or even multiple image-text pairs as the prompt. It would be interesting to see what happens and how can further unleash the potential of the model.
  • Failure cases should be analyzed and discussed.

问题

  • The authors use ViT-L be default. Did you tried backbone at different scales?
  • If we want to further scale up, how can we get more data like Semantic Scholar Computer Vision dataset? It would be interesting to have more discussion.
评论

Thank you for the review and suggestions. We address your comments below.

Q: It would be more convincing to have more quantitative experiments on more tasks, e.g., one-shot or few-shot tasks. Although it might be challenging, it will give readers a bigger picture.

A: We provide results of more Image-to-X tasks in Table 5 and visualization results in Figure 7 and 8.

Q: What about using multiple prompts? For example, multiple images, or even multiple image-text pairs as the prompt. It would be interesting to see what happens and how can further unleash the potential of the model.

A: We provide the results of multiple prompts in Table 8 and Figure 14. It shows our model can handle and benefit from multiple prompts.

Q: Failure cases should be analyzed and discussed.

A: For some of the failure cases in Bar et al, our IMProv could successfully address them with text prompts, e.g. the cat colorization and bottle segmentation of “Task ambiguity”. And our model could successfully generate the image that moves the orange to the center, though fails the other Non-aligned input-output example.

Q: The authors use ViT-L be default. Did you tried backbone at different scales?

A: We tried a larger network (ViT-H) as encoder, but there was no significant improvement. We suspect that scaling up the decoder instead of the encoder may help, but we did not have sufficient computing budget to try it. We plan to explore it as a future work.

Q: If we want to further scale up, how can we get more data like Semantic Scholar Computer Vision dataset? It would be interesting to have more discussion.

A: One potential way to scale up the dataset is by collecting instructional videos and incorporating them during the training process (e.g. videos of academic talks, oral presentations from conferences, and paper summary videos). We will add it to the discussion and explore this direction in future work.

评论

Hi, I think some person's name appears in your reply.

审稿意见
6

This paper introduces a model for solving computer vision tasks using multi-modal prompts based on inpainting. In addition to visual prompts, this method also utilizes text features encoded by CLIP as prompts, inputting them into the visual transformer using cross-attention mechanisms. To train the model, the authors collected a new dataset called Semantic Scholar Computer Vision (S2CV) dataset, which includes structured textual descriptions for precise task delineation, enabling the model to better handle distribution shifts. This model expands the range of tasks, it can perform to include images-to-X and X-to-images tasks.

优点

  1. The author's incorporation of text as an additional cue into the practice of visual in-context learning is insightful.
  2. Abundant experimental results demonstrate the effectiveness of the approach. The paper, through a comparison of IMProv trained on CCVF data and MAE-VQGAN, provides evidence that multi-modal prompts consisting of both text and images outperform purely visual prompts. Furthermore, the paper shows that IMProv trained on a mixed dataset of S2CV + LAION, which includes structured textual prompts, can further enhance the model's performance.
  3. The paper extends the model's application scope, enabling it to accomplish image-to-X and X-to-image tasks, contributing to the further advancement of In-context learning in computer vision.

缺点

  1. Using only LPIPS as a quantitative metric for extending applications in image-to-X and X-to-image tasks may not be particularly convincing. While there may be significant variations in metrics for image-to-X tasks, the use of standardized metrics like FID for X-to-image tasks would provide more solid experimental results.
  2. The choice of sample images in Figure 2 of the paper, which showcases the pipeline, does not seem to be well-suited and may not have a strong relevance to the tasks addressed in the paper.

问题

1.Do large language models that exclusively handle text, such as T5, outperform models that encode text using CLIP? 2.In Tables 4 and 7, the models trained with CCVF and LAION exhibit significantly better metrics compared to models trained solely with CCVF. Could the authors provide some analysis, possibly accompanied by visual results, to explain the reasons behind this substantial improvement?

评论

Thank you for the review and suggestions. We address your comments below.

Q: Metrics other than LPIPS for image-to-X and X-to-image tasks.

A: We generally agree, however, in [2*] we followed Wang et al. that used the images from InstructPix2Pix. These images are generated using Stable Diffusion, and there is no ground-truth annotation for the depth/edge/segmentation. Therefore, like Wang et al. we report MSE and LPIPS (see the table below).

Depth -> ImageImage -> DepthHED -> ImageImage -> HEDSeg -> ImageImage -> SegNormal -> ImageImage -> Normal
LPIPSMSELPIPSMSELPIPSMSELPIPSMSELPIPSMSELPIPSMSELPIPSMSELPIPSMSE
Supervised ICL (InstructPix2Pix)0.654.480.603.910.594.360.623.260.643.650.613.110.623.910.552.55
IMProv~(CCVF+LAION)0.614.600.522.690.513.830.461.540.593.350.502.610.563.630.482.03
IMProv~(CCVF+LAION+InstructPix2Pix)0.553.670.432.630.473.260.371.590.543.160.462.380.513.050.441.90

Q: The choice of sample images in Figure 2 of the paper, which showcases the pipeline, does not seem to be well-suited and may not have a strong relevance to the tasks addressed in the paper.

A: The sample image in Figure 2 is a random sample from our dataset, which comes from one of the computer vision papers, and represents the common image pattern of our dataset.

Q: Do large language models that exclusively handle text, such as T5, outperform models that encode text using CLIP?

A: We tried both CLIP and T5 text encoder. And we found the larger T5 text encoder doesn't improve the performance. Under “Random Class” visual prompt setting, T5 vs CLIP is 33.26 vs 36.29 mIoU. We choose to use CLIP for all the other experiments to save computation costs.

Q: In Tables 4 and 7, the models trained with CCVF and LAION exhibit significantly better metrics compared to models trained solely with CCVF. Could the authors provide some analysis, possibly accompanied by visual results, to explain the reasons behind this substantial improvement?

A: As reported in the "Dataset effect." section from Bar et al., pre-training on a larger and more diverse dataset helps the model to generalize better.

评论

Dear reviewer, we would greatly appreciate it if you could review our recent response. We believe that we have effectively addressed all of your previous concerns. We actively stand by for the last hours of the discussion phase.

审稿意见
3

This paper proposes a visual in-context learning framework termed IMProv. Based on an inpainting pipeline, IMProv incorporates both visual examples and task descriptions to prompt the model for different vision tasks. To train IMProv, a large-scale image-text dataset containing paired figures and captions from computer vision papers is collected by the authors. Experiments on different tasks are conducted to indicate the validity of IMProv to perform in-context inference.

优点

  1. Integrating text prompts into the visual in-context learning framework is interesting and the experimental results demonstrate its effectiveness.

  2. This paper proposes a large-scale image-text dataset, the Semantic Scholar Computer Vision dataset (S2CV), which is beneficial for the multimodal learning community.

缺点

  1. Obviously, there is a large difference between figure captions from papers and the prompts used at inference time. For example, the text prompt may not appropriately describe the task during training. In the paper, I do not see any discussions regarding this issue.

  2. In Table 3, methods are trained with different datasets, thus leading to an unfair comparison. The authors could report the performance of IMProv trained on CCVF.

  3. For the tasks of X-to-images and images-to-X, the evaluation is not rigorous. For example, LPIPS cannot assess the performance of semantic segmentation. The reported results do not support the claim that IMProv can well generalize to these standard computer vision tasks. Commonly used metrics for these tasks should be adopted for evaluation. In addition, the results of X-to-images look poor, where the generated images are blurry and lack details.

  4. For the ablation study, I have a few concerns.

    a) Dataset ablation: This is not a valid ablation. IMProv trained with different datasets can indicate the benefits brought from larger datasets, not different methods trained with different datasets.

    b) The authors claim that Figure 4 suggests a trade-off between visual and text prompts. However, the visual prompts are the same for a query in Figure 4.

    c) In Figure 5, what are the specific settings for these variants? “w/o text prompt” means no text prompt during 1) training and inference or 2) just inference? I think only the former could showcase the importance of incorporating text prompts.

  5. There are no comparisons with other SOTA in-context learning methods such as Painter [1*] (only a small comparison for the task of colorization is given in the appendix which is far less than enough) and Prompt Diffusion [2*].

    Overall, I think the paper presented an interesting idea for multimodal in-context learning. However, significant flaws in terms of experiments (see above points 2-5) make the claims of this paper very unconvincing.

    [1*] Images speak in images: A generalist painter for in-context visual learning. In CVPR, 2023.

    [2*] In-context learning unlocked for diffusion models. arXiv, 2023.

问题

See weaknesses. Questions are embedded into weakness points.

评论

Thank you for the review and suggestions. We address your comments below.

Q: Dicussion on the difference between figure captions from papers and the prompts used at inference time.

A: Although the training and testing prompts are different, the model is able to generalize to the testing prompts by training on large-scale datasets. During the pre-training, the model learns to extract meaningful representation from the noisy data. For example, CLIP’s image encoder and text encoder are pre-trained on noisy data as well, but it can zero-shot transfer to different images and texts at inference time. And according to Figure 17, the captions of our datasets include terms like “segmentation” “sketch”, “colorization” etc. So the gap between training and inference is not large.

Q: In Table 3, methods are trained with different datasets, thus leading to an unfair comparison. The authors could report the performance of IMProv trained on CCVF.

A: In Table 4, we report the accuracy of MAE-VQGAN and IMProv trained on the same dataset. CCVF just extends CVF by extracting the captions of the figures. CCVF and CVF share the same set of images. MAE-VQGAN only uses images, and IMProv is trained on both texts and images. So row 1 and 2 in Table 4 is a fair comparison between MAE-VQGAN and IMProv, indicating that text prompt can improve the model performance.

Q: For the tasks of X-to-images and images-to-X, the evaluation is not rigorous. For example, LPIPS cannot assess the performance of semantic segmentation. The reported results do not support the claim that IMProv can well generalize to these standard computer vision tasks. Commonly used metrics for these tasks should be adopted for evaluation. In addition, the results of X-to-images look poor, where the generated images are blurry and lack details.

A: In [2*], Wang et al. only reported MSE for all the vision tasks. We follow them to use a single metric for all the tasks. The images from InstructPix2Pix are generated from Stable Diffusion, there is no ground-truth annotation for the depth/edge/segmentation. We will report both LPIPS and MSE in the updated version.

Q: Dataset ablation: IMProv trained with different datasets can indicate the benefits brought from larger datasets, not different methods trained with different datasets.

A: We report the results of IMProv(S2CV+LAION) in the below table under the same “Random Class” setting. CCVF just extends CVF by extracting the captions of the figures. CCVF and CVF share the same set of images. MAE-VQGAN only uses images, and IMProv is trained on both texts and images. IMProv(CCVF) improves over MAE-VQGAN (CVF) by 2 points, indicates text prompt helps when visual prompt is randomly sampled from different classes. IMProv(S2CV + LAION) outperforms IMProv(CCVF + LAION) by 1.7 points, which justifies the effectiveness of our proposed S2CV dataset.

ModelAvg.
MAE-VQGAN (CVF)23.52
IMProv(CCVF)26.13
IMProv(CCVF + LAION)36.29
IMProv(S2CV + LAION)38.07

Q: The authors claim that Figure 4 suggests a trade-off between visual and text prompts. However, the visual prompts are the same for a query in Figure 4.

A: In Figure 4, we keep the query the same, and change the support from “without support” (first 2 columns) to “with support” (last 4 columns). We provide more examples of the trade-off between visual and text prompts in Figure 9, with different class random sample prompts, same class random sample prompts, and nearest neighbor prompts.

Q: In Figure 5, what are the specific settings for these variants? “w/o text prompt” means no text prompt during 1) training and inference or 2) just inference? I think only the former could showcase the importance of incorporating text prompts.

A: We imply without text prompt (empty string) during inference. Following MUSE, we drop 10% of the text prompt randomly during training, so our model is able to generalize to "w/o text prompt" during inference.

Q: There are no comparisons with other SOTA in-context learning methods such as Painter [1*] (only a small comparison for the task of colorization is given in the appendix which is far less than enough) and Prompt Diffusion [2*]

A: We acknowledge that our approach performs worse than them on their benchmarks because they are trained in a fully supervised way with ground truth annotations. In contrast, our model is trained on noisy data from the paper. We plan to explore the combination of supervised data and unsupervised in future works.

评论

Dear Reviewer wsst,

Thank you so much again for the detailed feedback. We're approaching the end of the author-reviewer discussion period. However, there are no responses yet to our rebuttal. Please do not hesitate to let us know if there is any further information or clarification we can provide. We hope to deliver all the information in time before the deadline.

Thank you!

评论

Thanks for the response. I still have some concerns as follows:

  1. For the tasks of images-to-X, since there is no ground truth for the dataset created by PromptDiffusion, why not adopt standard benchmarks (e.g., NYUv2 Dataset for depth estimation)? Current results evaluated via LPIPS or MSE do not make much sense to me.

  2. For the tasks of X-to-images, at least FID is an available and suitable metric to leverage. In addition, the authors claimed that they followed PromptDiffusion to only report MSE for all the vision tasks, but no comparison between this work and PromptDiffusion was given.

Given the unconvincing results and insufficient comparisons with existing baselines, I am still inclined to reject the current version of this paper.

评论

Thanks for your reply! We address your concerns as follows.

Q1: Since it’s not trivial to convert the RGB outputs to standard benchmark formats due to the time limit, we follow PromptDiffusion to report MSE and perception scores only. We will evaluate on standard benchmark in the future version.

Q2: We report the comparison with PromptDiffusion in the table below.

Depth -> ImageImage -> DepthHED -> ImageImage -> HEDSeg -> ImageImage -> SegNormal -> ImageImage -> Normal
LPIPSMSELPIPSMSELPIPSMSELPIPSMSELPIPSMSELPIPSMSELPIPSMSELPIPSMSE
IMProv0.614.600.522.690.513.830.461.540.593.350.502.610.563.630.482.03
PromptDiffusion0.361.760.311.130.301.410.200.590.392.110.512.100.442.420.742.17

Although IMProv underperforms PromptDiffusion on most tasks (except Image -> Normal ), we would like to emphasize that PromptDiffusion explicitly uses the paired dataset to train in a fully supervised way, including depth, HED and segmentation annotations, while ours only uses noisy data from Semantic Scholar and LAION. As mentioned in the fail cases in the PromptDiffusion paper, their model fails to generalize to some unseen computer vision tasks, e.g. Image -> Normal, since there are no such training data pairs in their training set. However, even though our model is trained with noisy data from the web, it still generalizes to output normal maps. Moreover, PromptDiffusion initializes the weights from ControlNet (Stable Diffusion 1.5), which is already a strong model for image generation, while our IMProv is trained from scratch without annotated data.

审稿意见
5

In this paper, the authors expand upon Vision prompting via image inpainting by introducing a new textual prompt modality. Their method, IMProv, is capable of conducting image-to-image translation tasks based on a textual task description and a few input-output visual examples. The authors introduce two datasets to support their approach. They demonstrate that incorporating a text prompt and utilizing larger datasets results in improved performance.

优点

  • The method is straightforward without unnecessary complexity. The model architecture closely resembles Bar et al. to minimize additional variables.
  • The authors conduct a quantitative analysis of various factors influencing the performance of this method, such as training data size and source (including more data and diverse standard data), different prompt designs, and more.
  • The proposed method offers flexibility as it can accept either a textual prompt, a visual prompt, or both.

缺点

  • This paper exhibits limited novelty since the proposed method is primarily a straightforward extension of Bar et al.'s work.
  • Qualitative analyses are relatively scarce within the paper.
  • It appears that there may be a trade-off between textual and visual modalities, with the advantages of a text prompt being less pronounced when paired with better visual prompts.

问题

Questions:

  • The authors explore various visual prompts, such as switching from a random class to a nearest neighbor (NN) class. It raises the question of how performance would be affected if the visual prompt is incorrect, (for example, an example for a different task), and how this compares to using a textual prompt alone.
  • In the appendix, the authors compare their method to SD. However, it may not be a fair comparison. It would be more equitable if the authors also fine-tuned on CCVF and S2CV datasets to assess whether the SD model can correctly inpaint in those scenarios.
  • Bar et al. discussed the limitations of their methods, including task ambiguity, non-aligned input-output, and out-of-distribution decoding. It would be interesting to explore whether these issues can be mitigated by using a textual prompt (for addressing task ambiguity and non-aligned input-output) and by employing more data (for handling out-of-distribution decoding).
  • Have the authors considered how different text encoders might affect the model's performance?
  • The authors examine the quality of visual prompts versus textual prompts. Both the authors and Bar et al. have investigated how performance changes with the number of support pairs versus performance. It would be valuable to understand the trade-off between the number of support pairs and the use of textual prompts.
  • It appears that the authors conclude S2CV+LAION results in better performance from Table 3. Although I do think the conclusion is likely true, however, it is not rigorous to say so since IMProv benefits from both textual prompts and data. An interesting experiment could involve retraining MAE-VGGAN with CCVF+LAION data and S2CV+LAION data for a fair comparison.

Minor:

  • The “textual prompts ablation” section is not only about textual prompts but both textual and visual.
  • TIt's worth mentioning that the loss used is similar to the MRM (Masked Region Modeling) in UNITER. A citation could provide relevant context. Chen, Yen-Chun, et al. "Uniter: Universal image-text representation learning." European conference on computer vision. Cham: Springer International Publishing, 2020.

Final rating: I maintain my original rating. While the authors provide numerous qualitative examples, which is good, the paper still lacks both qualitative and quantitative analyses, as I pointed out in my questions. Strengthening these aspects would enhance the paper's overall quality and impact.

评论

Thank you for the review and suggestions. We address your comments below.

Q: This paper exhibits limited novelty since the proposed method is primarily a straightforward extension of Bar et al.'s work.

A: In this work, our goal is to explore the tradeoff between text and image prompts based on Bar et al. work. Nevertheless, we show that text prompts can be used to improve the model performance, and we also collect a large dataset for multimodal in-context learning.

Q: Qualitative analyses are relatively scarce within the paper.

A: We provided extensive qualitative analysis in Figures 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 13, 14, both in the main paper and in the supplementary material.

Q: It raises the question of how performance would be affected if the visual prompt is incorrect, and how this compares to using a textual prompt alone.

A: There are no guarantees whether the visual or text prompt would affect the model more. In Figure 6, we provide an example of inconsistent visual and text prompts, and in this example, the model is more likely to follow the text prompt (generate a letter "T").

Q: In the appendix, the authors compare their method to SD. However, it may not be a fair comparison. It would be more equitable if the authors also fine-tuned on CCVF and S2CV datasets to assess whether the SD model can correctly inpaint in those scenarios.

A: We agree that the off-shelf SD is not directly comparable with our method. Our motivation for including the failure of SD is to motivate the necessity to collect S2CV for large-scale multimodal in-context learning.

Q: Whether the limitation of Bar et al. can be mitigated by using a textual prompt.

A: We show the results of failure cases in Figure 16. For some of the failure cases in Bar et al, our IMProv could successfully address them with text prompts, e.g. the cat colorization and bottle segmentation of “Task ambiguity”. And our model could successfully generate the image that moves the orange to the center, though fails the other Non-aligned input-output example.

Q: Have the authors considered how different text encoders might affect the model's performance?

A: We tried both CLIP and T5 text encoder. And we found the larger T5 text encoder doesn't improve the performance. Under the “Random Class” visual prompt setting, T5 vs CLIP is 33.26 vs 36.29 mIoU. We choose to use CLIP for all the other experiments to save computation costs.

Q: The trade-off between the number of support pairs and the use of textual prompts.

A: We plot the mIoU w.r.t the number of support in Figure 15. We run the experiments under grid 4x4 setting in Table 8, with “Random Class” support image. Similar to Bar et al., as we increase the number of support visual examples, mIoU goes up. It’s worth noting that when there are only 1 or 2 examples, the text prompt doesn’t help. It is because the model couldn't generate meaningful segmentation with a small number of visual supports due to the resolution (see Figure 13 for details). Under the setting that visual support number greater than 3, the text prompt consistently improves the results. It implies that our text prompt is orthogonal to the number of visual support pairs.

Q: Fair comparison for S2CV+LAION results.

A: In Table 4, we report the accuracy of MAE-VQGAN and IMProv trained on the same dataset. CCVF just extends CVF by extracting the captions of the figures. CCVF and CVF share the same set of images. MAE-VQGAN only uses images, and IMProv is trained on both texts and images. So row 1 and 2 in Table 4 is a fair comparison between MAE-VQGAN and IMProv, indicating that text prompt can improve the model performance. To compare S2CV with CCVF, we additionally report the results of IMProv(S2CV+LAION) in the below table (Table 10 in supplementary material) under the same “Random Class” setting. We report the average mIoU over 4 splits. IMProv(S2CV + LAION) outperforms IMProv(CCVF + LAION) by 1.7 points, which justifies the effectiveness of our proposed S2CV dataset.

ModelAvg.
IMProv(CCVF + LAION)36.29
IMProv(S2CV + LAION)38.07
评论

I appreciate the authors' reply: failure cases in Bar et al; the tradeoff between visual prompts and text prompts. Here are my followups.

  1. Visual Prompts vs. Text Prompts I don't fully understand the explanation that "the model couldn’t generate meaningful segmentation with a small number of visual supports due to the resolution". One would expect with a small number of visual supports, text prompts can compensate some, it appears that, counterintuitively, they may have a detrimental effect.
  2. Additional Analyses Thanks the authors for their comprehensive examination of failure cases from Bar et al. However, I do want the authors to delve deeper into the unique and unforeseen properties of their model. Exploring these aspects could provide valuable insights.
  3. Incorrect visual prompts I appreciate the insights presented in Figure 6 and that is why I came up with the question. It would be beneficial if the authors could furnish additional quantitative results to shed light on the model's choosing between visual prompts and textual prompts.
评论

Thanks for your reply. We address the follow-up concerns as below.

Q1: We would like to clarify that as Table 8 and Figure 15 show, as the number of support examples increases, the mIoU improves for both with and without the text prompt. Moreover, adding the text prompt will further improve the mIoU by 4 points on top of 7 visual prompts. And when there is only one visual prompt, with and without the text prompt yield very similar results. One reason for low mIoU is that most input pixels are just black, so the model tends to predict black pixels and ignore the input support and query images.

Q2: We would like to emphasize that, by conditioning on text, IMProv could address the ambiguity issue of the visual prompt. Specifically, as shown in Table 4 and Figure 5, when the input visual prompt is sampled from a random class, we show that more relevant text prompts could significantly improve performance.

Q3: Thanks for the suggestions. We will provide more quantitive results on the model’s choosing between visual prompts and textual prompts in the future version. To systemically study the effect of different modality prompts, we plan to construct a test set with contradicting visual and text prompt pairs, and measure the success rate of each prompt type.

评论

Dear Reviewer yR1X,

Thank you so much again for the detailed feedback. We're approaching the end of the author-reviewer discussion period. However, there are no responses yet to our rebuttal. Please do not hesitate to let us know if there is any further information or clarification we can provide. We hope to deliver all the information in time before the deadline.

Thank you!

评论

Dear reviewer, we would greatly appreciate it if you could review our new response. We believe that we have effectively addressed all of your previous concerns. We actively stand by for the last hours of the discussion phase.

评论

Dear reviewers,

Towards the end of the discussion phase, we trust that our response has successfully addressed your inquiries. We look forward to receiving your feedback regarding whether our reply sufficiently resolves any concerns you may have, or if further clarification is needed.

Thank you,

Authors

AC 元评审

This paper explores visual inpainting-based in-context learning when additional text prompts in the form of captions are available. The authors propose to train an inpainting model conditioned on visual inputs and text data. Together with a simple method for training text-conditioned image inpainting models, the authors propose a dataset (S2CV) composed of figures taken from computer vision research papers and captions.

Having read the paper, the reviews, and the rebuttal, I recommend rejecting this paper. The topic tackled in the paper is very interesting, the experimental setup is very challenging (using data in the wild, as opposed to methods such as Painter), but the presentation of the work could largely be improved. From the current version, the hypothesis the paper is trying to test needs to be clarified: Is it the importance of text data to solve in-context tasks? Is it the quality of the proposed conditioning model? Is it the quality and reliability of the web-crawled dataset (S2CV)? Too many confounding variables make the presentation too entangled. Starting the experimental section with an apples-to-oranges comparison with IMProv (S2CV+LAION) versus MAE-VQGAN (CVF) sets the tone.

I acknowledge that more ablations are available, given the other experimental results and some results provided in the rebuttal. Still, it requires significant effort from the reader. The effort required to consolidate and clarify the paper’s presentation justifies recommending rejection. I encourage the authors to improve the presentation and submit this work to another venue.

为何不给更高分

The presentation of the paper needs to be improved. The contributions are intertwined and confusing. The first result in the paper is an unfair comparison between results obtained with another architecture, on another dataset, with different modalities. As stated in the meta-review, I recommend improving the presentation by more carefully testing the different contributions of this work. Suppose the goal of this work was to set a new absolute state of the art in that context. In that case, all bets are off, and it is OK to compare that way, but it isn't the case as the paper does not compare to state-of-the-art systems such as Painter.

为何不给更低分

N/A

最终决定

Reject