PaperHub
5.5
/10
Rejected4 位审稿人
最低5最高6标准差0.5
6
5
6
5
3.0
置信度
ICLR 2024

FigCaps-HF: A Figure-to-Caption Generative Framework and Benchmark with Human Feedback

OpenReviewPDF
提交: 2023-09-23更新: 2024-02-11
TL;DR

To enable the generation of high-quality figure captions, we introduce FigCaps-HF a new framework for figure-caption generation that can incorporate domain expert feedback in generating captions optimized for reader preferences.

摘要

关键词
Figure Caption GenerationImage-to-Text GenerationReinforcement Learning using Human FeedbackFigure-Caption BenchmarkHuman Feedback

评审与讨论

审稿意见
6

The paper proposes a novel framework and benchmark dataset for improving figure caption generation using human feedback. The reinforcement learning with human feedback (RLHF) framework uses a small amount of expert annotations about the quality of figure-caption data and uses this to fine-tune figure-caption generation models. Extensive experiments show that their RLHF framework outperforms standard fine-tuning for various models like BLIP, ViT+GPT2 etc. For example, BLIP-RLHF achieves 35.7% gain in BLEU over fine-tuned BLIP. Moroever, they release the new benchmark dataset of figure-caption pairs labeled with human feedback such as helpfulness, takeaway, OCR etc. to promote further research in this area.

优点

Originality

The paper presents a new RLHF framework that utilizes limited human feedback to optimize caption generation models. The technique of learning an "oracle" model to predict feedback scores at scale is creative. Applying offline RL methods like upside-down RL in this context is also novel.

Quality and Clarity

The authors conduct extensive well-designed experiments on multiple models, ablations, and metrics to demonstrate the effectiveness of their approach. The paper is well written and easy to follow.

Significance

The focus on aligning figure captions to human preferences is significant for spurring progress in vision-language models for scientific literature. The paper proposes a feedback-based approach to align captions to human preferences and the annotated benchmark dataset will significantly help in improving research in this domain.

缺点

Please refer to the questions section for discussion of weaknesses. There is nothing in particular I would like to point out here.

问题

Questions

  1. From Figure 2 (and also intuitively), it seems like the captions that are small in length are often uninformative and hence unhelpful. Would it make sense to add a baseline where the model does not incorporate the Oracle model but just uses this heuristic-based "good" or "bad" token during training?

  2. The improvements on ViT+GPT2 are very minor (Table 2). I wonder would the CLIPCap model be a better backbone model as it learns an adaptor to transform the image features to the language model space, by combining it with required prompts or the human feedback scores.

  3. There are relevant missing related work and baselines: Matcha, ACL'23 (https://arxiv.org/pdf/2212.09662.pdf )and Deplot, ACL'23 (https://arxiv.org/pdf/2212.10505.pdf). Although both of these works do not focus on caption generation, they are very relevant for ChartQA task and can be used as backbones that can be fine-tuned for the task of figure caption generation.

Minor

  1. Make it clear in the main paper that the BLEU metric is BLEU-4.

  2. The information about how human scores look like comes very later in the paper after section 3.3/3.4. A very brief overview of what these scores are in the introduction would be helpful to the reader.

审稿意见
5

This ppaer proposes a RLHF framework, FigCaps-HF, a novel framework that leverages domain expert feedback to generate high-quality figure captions tailored to reader preferences. Empirical results show that the caption performs better than BLIP finetuning. The authors also release the benchmark and human feedback data.

优点

  1. The proposed pipeline holds promise for enhancing image captions, with compelling results demonstrated on smaller datasets.

  2. The comparison to the fine-tuned BLIP model highlights the superior performance of the RLHF method, as evident in both human evaluations and standard metrics.

  3. The release of human feedback datasets is a valuable contribution that can benefit a broad range of research communities. The authors also provided detailed human data collection interface.

缺点

  1. The experiments conducted in this study are limited in scale. The choice of a relatively weak baseline and potential data distribution shifts raises concerns about the fairness of the comparisons.

  2. The main technical novelty seems to be HUMAN FEEDBACK PREDICTION MODEL. However, this is limited discussion why such explicit prediction model can help RLHF. A fair comparions would be how proposed method performe better than end-to-end RLHF system.

问题

The authors mentioned As can be seen from the results, our model is able to achieve good results on the validation set. This highlights that our human-feedback prediction model demonstrates out-of-sample generalization and proves the statistical significance of our model. How different is the data distribution? How would you measure the genearlization ability?

评论

We thank the reviewer for acknowledging our proposed benchmark and RLHF framework. We address the reviewer's concerns below:

  • Clarification on out-of-sample generalization: For the human-feedback prediction model, we do not claim that it can generalize to out-of-distribution data. Instead, we claim that it shows out-of-sample generalization. We evaluated the out-of-sample generalizability of our human-feedback prediction model in the following experiments:

    1. We computed the statistics of the predictions of trained feedback prediction model on the train set with that of human-annotated data. We show the results below. From the table, we can see that the model predictions are similar to the human-annotated data.
    # Fig-Caption PairsHuman FeedbackMedianMeanStdMinQ1Q3
    Actual438
    Helpfulness33.011.1912
    Takeaway22.161.2211
    Visual22.111.0811
    OCR43.830.8014
    Predicted106,834
    Helpfulness2.892.891.07-1.272.17
    Takeaway1.952.061.03-1.061.33
    Visual1.912.021.01-1.231.31
    OCR3.883.840.830.0183.32
    1. We also conducted a 5-fold cross validation experiment using our annotated samples. We constructed a train/val set of the original 438 annotated samples by randomly sampling 219 samples to be selected as train set and remainder as validation set. We report our results (mean squared error and standard deviation over 5 runs) below:

    Helpfulness: 0.082±0.120.082 \pm 0.12 | Visual: 0.076±0.20.076 \pm 0.2 | Takeaway: 0.087±0.170.087 \pm 0.17 | OCR: 0.095±0.130.095 \pm 0.13

  • Baseline selection: Our main goal with this work is to provide a benchmark for developing and evaluating new methods that addresses the problem of model alignment with human preferences for vision-language tasks. To this end, we selected 10 baseline models from 3 different classes of models (OCR-only, Figure-only, and Figure-Caption), that have been shown to perform well for the general image-to-text tasks. We believe that our current baselines and RLHF method would provide the researchers with the knowledge needed to further develop better performing models and evaluate how different models perform when utilizing human-preference data. To this end, we believe our baseline models are justified and we hope that our proposed benchmark and RLHF method leads to further innovation in this area.

  • Utility of Human-Feedback prediction model: When compared with the online RL formulation of RLHF, both our offline-RLHF formulation and the online-RLHF methods require a reward prediction model pre-trained on human-preference data. However, the key difference between the two approaches is that once we infer the reward signal for each figure-caption pair, in our proposed offline RLHF method, we don't require the reward model for policy optimization. We proposed the offline upside-down RL method for the RLHF formulation as it would allow other researchers to compare their image-captioning models easily without the need to setup and train separate reward models.

    We don't claim that our proposed RLHF formulation is better or worse than the online RLHF method. However, as shown in previous works [1] offline RL methods are more stable, scalable and easier to use than online RL methods. We believe that for researchers focusing on developing new models for image-to-text task, our RLHF framework allows a simpler framework for evaluating their models for the human-preference realignment tasks.

General Comment: We truly appreciate the concerns raised by the reviewer as we specifically focused on these questions when designing our benchmark. We ensured that we provide results with different baselines and thoroughly evaluated the feedback prediction model. We believe that our work is a strong step in benchmarking image-to-text models and RLHF methods. Thus we hope that reviewer positively reconsider their initial rating.

审稿意见
6

This paper studies the problem of generating high-quality captions for scientific figures. Traditional methods follow the simple image captioning approach for this task. The paper made two contributions: (1) introduction of an eval benchmark (2) conduct the first RLHF-based method for this problem. The dataset and code are open-sourced. Experimental results also suggest promising gains over baselines.

优点

  1. The paper presents a set of evaluation methodology and benchmark, which can be useful for future research in the field.
  2. Experimental results show clear gains over the baseline.
  3. The paper is well written and easy to follow.

缺点

  1. The method used in the paper is a combination of existing techniques, i.e. RLHF. The technical innovation is hence limited.
  2. The applicability of the method is narrow since it is targeting for figure captioning task only. It will be more inspiring if it can be extended to general image/video captioning.

问题

I found it a bit confusing to the two contributions in this paper together as a "framework", as I don't feel they share synergy.

评论

We thank the reviewer for acknowledging the strengths of our work. We address the concerns below:

  • Confusion regarding framework description: We referred to the two components (Human Feedback Prediction model and Upside-Down RLHF method) as a framework because they are jointly used to align an image-captioning model’s output with human preference. The human feedback prediction model is used to generate human feedback scores which is then used as reward signal along with the state ( input figure image) and action (output figure caption) to define the single-step trajectory of experience in a contextual bandit environment [1]. The triplet of ( reward, state, action) is then used to update policy (image-to-text model) using the Upside-Down RL formulation [2]. However, to avoid confusion, we will modify the text to make it more clear how they share synergy.

  • Novelty of the work: We highlight the novelty of our work below:

    1. Our work provides a new benchmark for developing and evaluating new methods addressing the problem of model alignment with human preferences for vision-language tasks. We believe that our benchmark would allow more in-depth evaluation of new RLHF methods as, currently, the majority of benchmarks addressing human-preference alignment problems are specific to NLP literature.
    2. Furthermore, we propose a novel baseline for a simple but efficient implementation of RLHF principles in the form of offline upside-down RL. Our proposed RLHF method is novel as it addresses the RLHF problem as an offline reward-conditioned behavioral cloning. We show that this novel formulation is not only easy to use, but also demonstrates strong performance on the proposed benchmark. Formulating the RLHF problem as offline-RL also allows us to efficiently evaluate different image-to-text models as we are not dependent on the reward model for policy optimization once we have generated the (reward, action, state) trajectory. This should enable researchers to easily evaluate their own caption-generation models without worrying about the reward model.
    3. We want to highlight that the main goal of this work is to provide a new benchmark for evaluating different methods that incorporate human feedback for human preference-aligned model output. To this end, our contribution of new benchmark with human preference data evaluating different models and proposing a simple but effective RLHF method should be positively considered.
  • Narrow applicability: We agree with the reviewer that having a benchmark based on general image/video-to-text tasks is important. We specifically focused on figure-captioning task as one of our goals with the proposed benchmark is to allow researchers to evaluate different RLHF frameworks. To this end, we would like to obtain high-quality human-feedback data that is not ambiguous. For image captioning task with natural images, it is difficult to avoid ambiguity when collecting human preference data. This ambiguity emerges because it is difficult to evaluate different captions for a given natural image objectively. There could be multiple valid captions that correctly explains an image. Hence the human preference data can have high variance, making it difficult for downstream applications. On the other hand, scientific figures are more structured, and we can utilize domain experts to collect high-quality human preference data. This allows us to create a benchmark that can be used to accurately compare different models and RLHF frameworks.

    Nonetheless, our current framework can be easily extended to general image-captioning tasks, as we use general-purpose image-to-text models, and our proposed RLHF method can be applied to the new domain too. However, we consider this extension as out of scope because of the above-mentioned reason.

General Comment: We again thank the reviewer for acknowledging our work. We truly appreciate the concerns about the applicability of our method as we also share the same thoughts that a similar framework targeting general image/video captioning is important. We believe that our proposed benchmark is a strong step in this direction and would allow the researchers to efficiently benchmark their methods.

We hope that we answered the reviewer's concerns and that they positively reconsider their rating.

[1] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback.

[2] Jurgen Schmiduber. 2019. Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions

审稿意见
5

This paper focuses on existing Figure-to-Caption models that fall short of metrics like helpfulness, explainability, and visual-descriptiveness. To this end, they introduce an RLHF framework for figure-to-caption generation with a small amount of actual human feedback for generating high-quality captions. They also propose a new benchmark of figure-caption pairs with caption quality scores for a better understanding of the read-aligned figure-caption pairs.

优点

  1. The proposed method is effective. As shown in Fig.2, the RLHF has improved the BLIP and Vit+GPT2 models with a clear improvement.

  2. The benchmark of figure-caption pairs is helpful. This paper provides a new benchmark of figure-caption pairs is helpful for the research community, and they have done a great release.

缺点

  1. Time Complexity Analysis. This paper claims that using offline reward-conditioned behavioral cloning for model optimization is computationally efficient. It is not convinced, that you should compare the time complexity analysis between the offline RLHF and online RLHF methods.

  2. The proposed method is not novel enough. The framework of the RLHF for figure-to-text can not provide more insight for the understanding of the read-aligned figure-caption pairs. It may not be enough for the technical contribution.

  3. The writing needs to be improved. There are many typos in the main text. e.g. figure-ti-caption -> figure-to-caption in Section 6.

问题

As shown in weaknesses.

伦理问题详情

N/A

评论

We thank the reviewers for acknowledging the effectiveness of our method and the usefulness of our proposed benchmark. We address the reviewer’s concerns below:

  • Computational Efficiency: We stated that our offline reward-conditioned behavioral cloning method is computationally efficient to online RL methods for human-preference alignment of model predictions in the context that, for once, we have a trained reward model (Human-feedback prediction model), and we have inferred the reward signal for individual (figure, caption) tuple to define our trajectory of (reward, state, action), we do not require the reward prediction model for optimizing the policy. Thus the model optimization step is only dependent on the image-to-text model selected as policy. On the other hand, for online-RL methods, we require the initial supervised fine-tuned model, the policy model, and the reward model together to optimize the policy. Thus, from the perspective of runtime and compute memory requirements, our proposed offline-RL method for learning from human feedback is computationally efficient.

    Furthermore, previous work [1], [2] in the context of natural language generation have highlighted that online-RLHF method is often unstable and substantially increase the complexity of the training setup because online methods often require more computational resources per step due to real-time interactions with environment and may require multiple gradient updates per step. Offline RLHF approach can be computationally more efficient per sample because they process a fixed dataset without the need for real-time interactions with the environment. We will clarify this better in the text.

  • Novelty Concerns and Evaluation of reader-alignment after RLHF training: Based on our qualitative and quantitative, we do find that our proposed RLHF method does improve the quality of the captions. To further evaluate the generated captions, we conducted a small scale human evaluation experiment. Specifically, we randomly select 100 figures from the Test set of our proposed benchmark and generate captions using the BLIP and BLIP-RLHF models. We present the triplet of Figure, corresponding BLIP and BLIP-RLHF generated captions (after randomizing the order of the two captions) to 10 human subjects. Each human subject is asked to rank the two captions based on which caption they think is better. We ask the subjects to specifically consider helpfulness, visual-descriptiveness, OCR alignment and takeaway while ranking individual pair of captions. To guide the subjects, we first explain each metric [helpfulness, visual-descriptiveness, OCR alignment and takeaway] and present each human subject with 100 samples from our human annotated dataset with individual figures, groundtruth caption and the corresponding metric scores (recorded in 5-point Linkert scale). From our study, we find that on average 85% of the time, BLIP-RLHF generated caption was selected as the better caption relative to BLIP generated caption. From our small-scale study, we conclude that RLHF does improve the quality of the captions when compared to fine-tunning existing vision language models for the task of figure-caption generation.

    We want to highlight that our work provides a new benchmark for developing and evaluating new methods that addresses the problem of model alignment with human preferences for vision-language tasks. We believe that the availability of our benchmark would allow more in-depth evaluation of new RLHF methods as currently, the majority of benchmarks addressing human-preference alignment problems are specific to NLP literature. To this end, our contribution of a new human-preference alignment benchmark, evaluation evaluating models and proposing a simple but effective RLHF method should be positively considered.

General Comment: We thank the reviewer again for their helpful feedback. We hope that we were able to address the concerns. We believe that our benchmark, coupled with multi-dimensional human preference data encompassing various aspects of human preference, and the proposed offline RLHF framework, will provide researchers with valuable tools to develop and evaluate improved methods for both image-to-text models and alternative RLHF schemes. Thus we hope that reviewer positively reconsider their initial rating.

[1] Charlie Snell, Ilya Kostrikov, Yi Su, Mengjiao Yang, and Sergey Levine. 2023. Offline rl for natural language generation with implicit language q learning.

[2] Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.

AC 元评审

This paper was reviewed by four experts in the field and received mixed reviews. Reviewers liked the open-sourced code and data that facilitate future research. The common concern raised by reviewers is the method's generalization to more general captioning tasks.

The AC read the paper, the reviews, and the authors' responses carefully. The AC believed that the concerns around the limited scope and the overall method outweigh the paper's strengths. While the paper has merit, we regret that the paper cannot be recommended for acceptance at this time. The authors are encouraged to consider the reviewers' comments when revising and resubmitting the paper.

为何不给更高分

While the proposed method is a generic method by training a human feedback prediction model from limited human data for improving image captioning using RLHF, the method is only applied to a narrow task of captioning scientific figures, limiting its interests to broader ICLR audience.

Moreover, reviewer 3Ti9 raised some valid concerns but the authors didn't provide any response.

为何不给更低分

N/A

最终决定

Reject