FlexCap: Describe Anything in Images in Controllable Detail
New image captioning model that can generate short or detailed descriptions of specific image areas.
摘要
评审与讨论
This paper introduce a versatile flexible-captioning vision-language model called FlexCap, capable of generating region-specific descriptions of varying lengths. The FlexCap use caption length to control the information density of the generated sentences. The paper also introduces a large-scale dataset of image-text-box triplets for this flexible region captioning task. The proposed model outperforms baselines and achieve leading performance under this flexible captioning scenario.
优点
-
This paper proposes a novel mechanism to control the complexity for region captioning.
-
Introduce a large-scale image-text-box dataset.
-
Achieve SOTA VQA performance on several benchmarks.
缺点
-
About the dataset building:1) It seems that there is no human involvement in the data construction process. I worry about the correctness and diversity of the generated sentences. 2) One of the key contributions of the data is the different lengths of descriptions for a region proposal. There should be at least a statistic about the length distributions. 3) Based on Table 1(b) and the cases presented in Figure 5, is the max length 8 and is it enough for a complex region/object?
-
About the baselines: I think it is important to list the backbone types and parameter scaling for each baseline in the result tables. For example, the OWL-ViT has ViT-H/14, ViT-L/14, ViT-B/32, etc. The CLIPM could be applied to different backbones. The parameter scaling (especially the LLMs) will significantly affect performance.
-
About the evaluation: 1) Using the CLIP similarity to map captions and objects is plausible but just one-sided. Have you considered adopting some rule-based verbalization methods to map captions and class names? 2) Back to the motivation, I wonder how to evaluate the information density for each generated caption? Do the authors have any insights?
-
Overall, the proposed framework seems conventional. Can the author discuss what makes the proposed model outperform current similar architectures? How and to what extent does the prefix token influence the performance?
问题
Please see my weakness comments.
局限性
Please see my weakness comments.
We thank the reviewer for their time and effort in reviewing our work.
We address the concerns raised below:
About the dataset building:1) It seems that there is no human involvement in the data construction process. I worry about the correctness and diversity of the generated sentences.
We were also curious if a dataset generated in a fully automated manner would be useful. For correctness, we were interested to know if the produced captions are useful for downstream tasks and found them useful for a variety of VQA tasks and region classification. Diversity-wise we choose the n-grams from alt-text of the captions instead of pre-selecting the words. The vocabulary is diverse because of the phrases found in the alt-text of images. Please see examples of the dataset in Figure 11.
One of the key contributions of the data is the different lengths of descriptions for a region proposal. There should be at least a statistic about the length distributions.
These statistics are given in Figure 10 in the Appendix. We agree that this is important.
Based on Table 1(b) and the cases presented in Figure 5, is the max length 8 and is it enough for a complex region/object?
We have empirically found the length to be sufficient for tasks like region classification and VQA tasks across different datasets where our model outperforms baselines with a max length of 8. We also found that the model pre-trained with max length of 8 words can be fine-tuned to produce longer descriptions for regions like those present in the Visual Genome dataset. In the Supplementary material (flexcap-spatial.html), we show that the model produces a mix of short and long captions in the style of Visual Genome. There are 50 captions in the 40 images with length longer than 8. Finally, it is possible to use lengths longer than 8 words for pre-training.
About the baselines: I think it is important to list the backbone types and parameter scaling for each baseline in the result tables. For example, the OWL-ViT has ViT-H/14, ViT-L/14, ViT-B/32, etc. The CLIPM could be applied to different backbones. The parameter scaling (especially the LLMs) will significantly affect performance.
Yes, we agree. We will add this information to the final version of the paper.
About the evaluation: 1) Using the CLIP similarity to map captions and objects is plausible but just one-sided. Have you considered adopting some rule-based verbalization methods to map captions and class names?
It is possible to build a more complicated method than using CLIP similarity. But this is an evaluation technique that allows us to show the captions generated by our model are correct. So we use text-image embedding matching used commonly for evaluating zero-shot recognition in papers (CLIP). Even with this simplistic approach, we find that we significantly outperform the baselines. We will add an ablation using descriptive classnames in the final version of the paper.
- Back to the motivation, I wonder how to evaluate the information density for each generated caption? Do the authors have any insights?
We are not directly addressing information density in captions. Instead, we are using caption length as a proxy for information content. We observe that captions incrementally add more information as we increase length and add this ability to the model for users to control. If a length 1 caption just mentioned object class name, length 4 would contain information about attributes, and length 8 would contain information about context and attributes. This can be seen in the examples shown in Figure 5 and the Supplementary material (flexcap-length.html). It is possible to parse the produced captions to show length 1 captions are mostly nouns, length 2-4 are adjectives and nouns, but for longer captions image context we need a more complex method of measuring information density than just parsing.
Overall, the proposed framework seems conventional. Can the author discuss what makes the proposed model outperform current similar architectures?
Our contributions are: endowing vision-language models with a new capability of producing length controlled localized captions and producing a large-scale dataset of image-box-caption triplets that can be used to train the model. We are not proposing architectural changes to achieve length-controlled localized captions.
How and to what extent does the prefix token influence the performance?
The prefix token enables a new capability for the captioning models. The same model can be used for producing short and long captions in a controllable manner. We further show our dataset creation technique allows us to generate length-controlled captions for both small and large regions in an image. Also, in the Length Conditioning subsection in Section 2, we describe how the length token helps when there are many captions for the same box.
This paper proposes a vision-language model termed as FlexCap, which given a specific region in the image represented as a bounding box, outputs a description of that region in a length-controllable fashion where the exact length of the generated description can be controlled via a prefix token. First the authors harnesses a dataset of region-text pairs using existing large-scale image-caption datasets such as YFCC100M and WebLI. Specifically, they extract n-grams from the caption, apply a filtering process, and each of those n-grams are then fed to OWL-ViT to get a bounding box. This allows for multiple variable-length captions for a region. They then train FlexCap on that dataset. Generating a large set of descriptions for many possible regions in the image allows for a complete detailed understanding of the image, and those generated descriptions can be fed to a LLM to perform VQA by considering and reasoning over those descriptions. Experiments are conducted on dense captioning (without localisation), zero-shot VQA (image and video), and other tasks such as image labeling, object attribute recognition, and visual dialog.
优点
- The idea of using controllable caption generation for specific image regions is interesting an can be beneficial in the case of dense captioning, since some regions are more detailed than others and need more words to be described, while other regions can be described sufficiently by 1-2 words. Thus, having this controlability can be helpful and enhance efficiency, i.e., generating 3-words per region is much more faster and efficient than generating 20 words per region, especially if those descriptions are to be fed to a LLM, where the context length decreases with less words per description.
- The authors show how to make use of existing VLM to serve as annotaters. This is very heloful, especially in scenarios where we don't have a specific dataset tailored to our needs.
- Evaluation is performed on different tasks, and impressive zero-shot performance.
缺点
- [W1] The novelty is limited; it only involves adding a length prefix to condition the generation of length-controllable output. However, the idea of generating region proposals from image-text pairs, and then learning to describing those regions has been a hot research topic lately. Some related works:
- Osprey [R1] generates descriptions for fine-grained objects rather than coarse-grained bounding boxes with also the capabilities of intruction-based training on those fine-grained regions.
- Kosmos-2 [R2], where given an image-caption pair the text is processed in a similar way that the authors do to extract sentence chunks (of variable length), and each text chunk is then associated to a bounding box using the GLIP grounding model (instead, the authors use OWL-ViT as the grounding model) and an autoregressive transformer is learned to predict the description of each region. Actually, Kosmos-2 is also capable of grounding, and does not require an external model to perform grounding at test time.
- In the work of [R3] they use MiniGPT to extract a caption and then the noun from that caption, as well an OCR detector. They then feed these to a LLM to imagine possible other nouns, as well as BLIP model where they feed the cropped region and generate a caption for that cropped region, and extract the noun phrases
- [R4] takes a step futher and not just generates a caption of a region of interest, but also allow perform instruction-based training such that a user can chat with a given region of interest.
- There is also a much more close work to the authors: [R5], however, I will not consider this paper in my decision-making process, as it is published on arXiv and not yet published in a peer-reviewed venue.
Therefore, I think this work is an incremental improvement of plugging in the length-control idea (which has also been well studied, and the authors mention that in the related work) in region-based captioning
-
[W2] How should one choose the length of the region descriptions? Isnt there a heuristic to estimate what regions need more words than others? What is the point of having a length-controlled method of generating region descriptions, if there still needs to be a manual selection of the length?
-
[W3] Controllable image captioning itself is not a new concept. The work of [R6] proposes to do this just in another way where the caption (and not the length) can be controlled. Similarly, their method can control the detail and length of the caption. How does the method of the authors compare to this work?
-
[W4] InstructBLIP [R7] is not compared to, which seems to perform better than the proposed method (see GQA for example). Also, the authors dont compare on the standard VQAv2 test-dev set benchmark, which makes it hard to compare if their work outperforms previous works. Most works report on VQAv2.
-
[W5] The dataset that the authors generate, as well as the performance of FlexCap, is highly limited by the performance of OWL-ViT (since it serves as an annotation tool to extract regions). Any biases and errors in OWL-ViT will also be transferred to the dataset and FlexCap. Consequently, the model and dataset are heavily influenced by the quality of OWL-ViT.
-
[W6] L178, regarding the zero-shot setting, the authors finetune the model on COCO (according to L168). But the VQA dataset (or its variants) is actually built on COCO. While it is not finetuned on the VQA annotations of question-answers, the model is still trained and sees the image, its caption, and its bounding boxes, which all three I think cover well for VQA. Therefore, this method cannot be seen as zero-shot.
-
[W7] For VQA, the amount of captions generated seem to be huge (128 regions number of prefixes used), and therefore is this a fair comparision with other works? Couldn’t a similar or better performance from the compared models, be achieved by using more generated captions (e.g., by different samplings)? There is no ablation study on this, and on performance achieved by varying the number of generated descriptions, which I find important.
Minor/Unclear:
- Table 1a is not cross-references
- Line 137, for each region, do the authors generate 20 captions for each of the (1,2,3,4) lengths? So that equates to 20x4 = 40 captions per bounding box? Are you then averaging over all the 40 captions in CLIP text encoder?
- Line 160, what language queries are used here? Is the process similar to how the dataset is built, using N-gram processing?
- Line 168, Isn't the model natively built to describe regions/boxes rather than localize them? How is the model adapted to localization?
References:
[R1] Osprey: Pixel Understanding with Visual Instruction Tuning, CVPR 2024
[R2] Grounding Multimodal Large Language Models to the World, ICLR 2024
[R3] Towards Panoptic Visual Recognition and Understanding of the Open World, ICLR 2024
[R4] The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World, ICLR 2024
[R5] GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
[R6] Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions, CVPR 2019
[R7] InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning, NeurIPS 2023
问题
Regarding [W1], there are some works [R2, R3, R4] published in ICLR 2024. But I also understand that they were officially published and presented shortly before the NeurIPS deadline, and therefore, I will not consider them as grounds for rejection. However, the authors should take note of them and cite them, as they are proposing very similar works. However, how does the proposed method compare to [R1]? Moreover, although the length-controllable concept is incremental, I believe it is important in such systems. And the authors address this problem well.
I would also like to hear a clarification on [W2, W3, W6 and W7].
局限性
Limitations are discussed. No special issues on negative societal impact.
Table 1a is not cross-references.
Thanks for catching this. We will fix this.
Line 137, for each region, do the authors generate 20 captions for each of the (1,2,3,4) lengths? So that equates to 20x4 = 40 captions per bounding box? Are you then averaging over all the 40 captions in CLIP text encoder?
We report both non-averaged accuracy using the highest scoring caption and accuracy using the average of top-20 captions (out of the 80 total captions) in the CLIP text encoder when presenting the results in Table 1. The top-k captions are chosen on the basis of mean log-likelihood of the generated caption. We find using top-20 leads to significant boost in performance without any additional training.
Line 160, what language queries are used here? Is the process similar to how the dataset is built, using N-gram processing?
We do not use language queries for this step. We use the top-128 boxes based on objectness score from OWL-ViTv2 which does not require any text queries to produce objectness score for a box. Please refer to Figure 15 for details.
Line 168, Isn't the model natively built to describe regions/boxes rather than localize them? How is the model adapted to localization?
The FlexCap model does not do localization. In this work, we are exploring a paradigm different from prior work which has usually followed describe-then-localize. We are looking at localize-then-describe. The localization can be done using object proposal methods. Then FlexCap takes each object of interest and describes them. In particular, in Line 168 we use the detection class name as the ground truth caption for fine-tuning.
We express our gratitude to the reviewer for their insightful feedback and thorough review.
However, how does the proposed method compare to [R1]?
In R1, the training dataset is generated by querying GPT4V with regions of interest found by SAM. While they do train OSPREY with this dataset, the main ability of describing regions was already there in GPT4V. We do not know publicly what the dataset and architecture used for GPT4V are. In our work, we are proposing a ground up way of creating a large-scale region description dataset from alt-text data with a publicly available object detector (OWL-ViT). Furthermore, R1 shows a language model (VICUNA) can be trained further with their dataset to learn a visual instruction tuned model. We show that for describing objects we do not need a full-fledged language model but training a reasonably sized text decoder from scratch is sufficient. The other difference is length conditioning but that was not the focus of R1.
Regarding [W1], there are some works [R2, R3, R4] published in ICLR 2024.
Thank you for sharing these references. We will add them to related work as they all are relevant.
[W2] How should one choose the length of the region descriptions?... What is the point of having a length-controlled method of generating region descriptions, if there still needs to be a manual selection of the length?
The objective is to have one model that can produce long and short descriptions as needed. For example, if someone is interested in recovering only object names they can request for length1 captions from the model. However if one is interested in a longer description of an object then they can ask for length8 captions. In a way, our model merges the label spaces of object detection datasets and dense captioning datasets by using length conditioning.
Although there is a common-sense based length prior that most people use for describing a region, people can also describe the content with a few or a lot more words when requested. Our model has this flexibility to be queried at a desired level of detail. But by fine-tuning the model with common-sense human captions (e.g. Visual Genome), our model adjusts itself to the average human biases in information density. We show examples of this in flexcap-spatial.html in supplementary material. Although some heuristics can be introduced, we believe fine-tuning with a dataset that captures the desired information density is a more configurable approach to the output information density problem.
[W3] Controllable image captioning itself is not a new concept. ... How does the method of the authors compare to this work?
There are several differences with R6. First, R6 only handles image captioning. While we handle object, regions, and full images by using the flexibility of conditioning with a bounding box. Second, their conditioning approach is a specific way of producing different captions by changing the order of detected objects as conditioning. They detect N objects in an image. The captioner is shown different orders of N objects to produce different captions. While this produces grammatically correct captions in different ways, it does not necessitate new information being generated in the captions. The objective in our work is to use length conditioning as a proxy for information content. Both are valid but different objectives of conditional image captioning. Finally, our approach allows for prefix conditioned captioning which can be used to extract attributes of interest: color, material, action, function and text. This capability emerges due to pre-training the model on a large-scale dataset which is missing from R6.
[W6] L178, regarding the zero-shot setting, the authors finetune the model on COCO (according to L168)... Therefore, this method cannot be seen as zero-shot.
In our study, we used the term "zero-shot" to indicate that the training process did not involve any VQA annotations of question-answers. The evaluation was conducted on non-train splits of COCO, ensuring that the model has not encountered the evaluation image was not seen in any phase of training. Hence, we maintain that the setting can still be considered zero-shot. To provide greater clarity, we will explicitly mention this in the final version of the paper.
[W7] For VQA, the amount of captions generated seem to be huge (128 regions × number of prefixes used), and therefore is this a fair comparision with other works?...
The VQA baselines systems are trained to produce answers using different approaches (mostly end-to-end). They do not have a similar system to ours where we generate a large chunk of text and then deduce a single answer from that. We will add an ablation on how the number of generated descriptions affects performance.
[W5] The dataset that the authors generate, as well as the performance of FlexCap, is highly limited by the performance of OWL-ViT ...
Yes we mostly agree with this assessment and have mentioned this in the limitations section. However, just to clarify, OWL-ViT does not define what to caption in images. This information comes from alt-text associated with the images in the form of n-grams. OWL-ViT mainly localizes provided n-grams on the images using the box proposals which can arguably cover almost all the interesting regions in the image. Moreover this automated approach allowed us to scale both in size and diversity which often is the main influencer of the performance, as demonstrated with many results in our experiments section.
[W4] InstructBLIP [R7] is not compared to ... Most works report on VQAv2.
To clarify any concerns, we will test our method on VQAv2 and provide the results on this dataset in the final version of the paper. We will also add InstructBLIP to the list of baselines.
I thank the authors for the rebuttal and for clarifying my concerns. Since my concerns are clear, I have raised my rating.
The authors should make sure to incorporate my comments, as well as other reviewers comments, into the final version, and especially the experiments that were promised (W4, W7).
This paper introduces a versatile captioner capable of generating region-specific descriptions with controllable information density. This functionality enables dense captioning tasks and enhances visual question answering (VQA) by integrating with a LLM. The paper also presents a large-scale dataset containing image-text-box triplets, which is valuable for the community to explore region-controllable captioning capabilities. Leveraging this dataset, FlexCap can produce localized visual descriptions with adjustable lengths. By providing various localized textual descriptions of images as input to a LLM, FlexCap-LLM demonstrates strong performance in VQA tasks.
优点
-
The proposed dataset can promote community research on visual-controllable captioning task, which is useful for the development of user-friendly vision-languague models.
-
FlexCap is easy to follow, and its controllable captioning capability, with positional information and varying information density, is beneficial for downdream tasks like VQA.
-
The experiments are comprehensive, demonstrating the capabilities of region control and length control (Sec 4.1, Sec. 4.3). The VQA results generated by FlexCap-LLM show good human-interpretable representation of the generated located descriptions.
缺点
-
The architecture of FlexCap and its training setup lack novelty, as it is a typical transformer-based captioning model. However, this does not lead me to reject this paper, as the contributions on task and dataset is useful.
-
The authors should carefully consider their statements in the paper. While this paper achieves region and length control, there are many other controllable signals such as mask/point control in visuals and emotion/style control in text (as seen in Caption Anything [45]). Referring to FlexCap as a versatile flexible-captioning vision-language model might be an overstatement.
-
In lines 66-68, the authors state that "the next-word prediction loss encourages the model to increase its score for the <e> token and decrease the score for the word playing." This statement is intuitive. I believe the probability of "<e>" and "playing" depends on the frequency of occurrence of "a dog <e>" and "dog playing …" during training, i.e., depending on the training set. It would be beneficial to see the statistical probabilities of "<e>" and "playing" following "dog" to support the authors' statement.
问题
Q1. The abstract is divided into two paragraphs (Line #6 and Line #7).
Q2. Section 4.1 on Correctness lacks a reference of Table 1(a).
Q3. Line 249 lacks a reference of Figure 8.
Q4. Regarding weakness 1, can the authors provide some statistics to support their statement?
局限性
It is commendable that the authors discuss the biases present in FlexCap. However, FlexCap is quite basic, and FlexCap-LLM is not an end-to-end model. It would be intriguing to see how these components could be combined into a fully integrated vision-language model. As mentioned in Weakness 3, FlexCap lacks sufficient flexibility; incorporating more visual and textual controls is an important area for further development. But it actually takes a step forward.
We thank the reviewer for their feedback and meticulous attention to detail in reviewing our paper.
We address the concerns raised by them below:
Referring to FlexCap as a versatile flexible-captioning vision-language model might be an overstatement.
We used this description to highlight the fact FlexCap can be conditioned using bounding box, desired caption length, and caption prefix to produce diverse captions of objects and regions in images. We show many examples of the flexibility in Fig.5, Fig. 8 and Fig. 13 in paper and in the supplementary webpage. That said, we agree that, when implemented effectively, there could be many more spatial and textual controls for captioning and with FlexCap we take a significant step towards properly exploring this space. We will also soften the claims on versatility in the abstract by focusing more on the specific controls that we introduce (i.e. bounding boxes, length, and caption prefixes) while providing the larger context on potential spatial and textual controls.
Also, while theoretically the CaptionAnything system can produce outputs of different lengths, we tested it for length conditioning and it does not consistently return outputs of the desired length.
Regarding weakness about “statistical probabilities of "<e>" and "playing" following "dog" to support the authors' statement.”, can the authors provide some statistics to support their statement?
We agree that whether this occurrence will be a problem depends on the dataset statistics. To quantify the prevalence of this problem, we compute a statistic: for each image box, we consider all pairs of captions and measure the fraction sharing prefix words. For instance, one box has three captions: "dog <e>" "dog playing <e>", "dog playing with a frisbee <e>" which share a prefix "dog". While another caption for the same box: "brown dog <e>" does not share a prefix with captions beginning with "dog". After averaging this metric across all images in the dataset, we found that 30.8% of caption pairs share a prefix. The length conditioning token assists in distinguishing between captions with the same prefix while also providing the model with a novel capability during inference. When length conditioning is applied, the probability of prefix similarity decreases from 30.8% to 11.1%. We appreciate the encouragement to explore this issue quantitatively and plan to include this analysis in our paper.
We answer the other questions asked by them below:
Q1. The abstract is divided into two paragraphs (Line #6 and Line #7).
We will merge the two paragraphs into one.
Q2. Section 4.1 on Correctness lacks a reference of Table 1(a). Q3. Line 249 lacks a reference of Figure 8.
Thanks for catching this. We will add a reference to Table 1(a) in section 4.1 and to Figure. 8 in Line 249.
The author addresses most of my concerns, and the statistics provided in the rebuttal are quite insteresting. I look forward to seeing them included in the final paper. After considering the author rebuttal, as well as the feedback from other reviewers and the corrsponding author's responses, I decide to raise my rating. It's good to see how the proposed dataset will influence the community.
This paper unanimously receives positive rates thanks to the dataset novelty of controllable captions and convincing experiments with the dataset. Although overall reviews are positive, the clarification in the rebuttal should be reflected in the final draft, including dataset analyses and additional details of experimental results.