6.0

/10

Spotlight4 位审稿人

最低3最高7标准差1.7

3.5

置信度

正确性2.5

贡献度2.5

表达2.8

NeurIPS 2024

Voila-A: Aligning Vision-Language Models with User's Gaze Attention

Kun Yan,Zeyu Wang,Lei Ji,Yuntao Wang,Nan Duan,Shuai Ma

OpenReview PDF

提交: 2024-05-15更新: 2025-01-15

TL;DR

Aligning Vision-Language Models with User's Gaze Attention

摘要

关键词

Vision Language ModelHuman GazeMultimodalControlled Generative ModelAR/VR

评审与讨论

审稿意见

评分: 3置信度: 42024-07-11

This work proposes an approach to align Vision-Language Models (VLMs) with users' gaze. To develop such a method, the project first creates a mock dataset of human gaze from image captions, known as Voila-coco. It then develops a model to integrate gaze data into VLMs. Additionally, a new dataset called Viola-gaze has been introduced.

优点

This paper proposes an interesting idea of aligning the attention of vision-language models to human attention indicated by fixation locations. It contains several components ranging from data collection to model development.

缺点

Motivation: The motivation for aligning VLMs to gaze attention is to enhance the models' interpretability and effectiveness (L8). However, if I didn't miss anything, the experiments do not evaluate interpretability. More importantly, it is not clear to me why textual descriptions are not enough in the provided application scenarios and what the benefits of incorporating gaze are.
Clarity: I find it difficult to follow and understand what exactly has been done in the project. For example, Sec. 2 explains that this project uses BubbleView to collect gaze-like data. However, the motivation to do so is not clear. It is also not clear how good the transformation of trace data to pseudo gaze sequences is. Overall, the write-up and general presentation can be improved.
Limited technical contribution: I understand the challenge when training with limited data, however, this also limits the design of the model. The encoding of heatmap generated from gaze sequences seems to have limited technical contribution.
Evaluation: first of all, it is not clear what Otter and Kosmos2 refer to and why they are chosen for comparison. It is also not clear what the GPT-4 ranking results mean. Similarly, why not just directly use human preference to evaluate the performance of VOILA? What's the performance of the reward-scoring network? It seems to me that both evaluation methods need to be evaluated first.

问题

Why textual descriptions are not enough in the provided application scenarios, for example directly include orange in the question instead of using it with gaze point?
What are the benefits of incorporating gaze?
Why not use an eye tracker to directly collect gaze instead of using BubbleView?
What do Otter and Kosmos2 refer to? Why are they chosen for comparison?
Why not just directly use human preference to evaluate the performance of VOILA? What's the performance of the reward-scoring network?
Similarly, what would be GPT-ranking results compared to human preference?

局限性

The authors have addressed the limitations to a very limited extent. There is no discussion related to the gaze modality and the limited amount of data for example.

作者回复

2024-08-07

We sincerely appreciate your critique of our work and will take your comments into consideration for future revisions. We would like to clarify a few points that may have been misunderstood, hoping this will enable you to re-evaluate our work more accurately.

W1 & Q1, Q2 on Motivation We envision that gaze-facilitated visual question answering may take place on future wearable head devices, such as VR/AR and smart glasses. During such an interaction process, users may ask contextual questions that have proven to be ambiguous by many human-computer interaction (HCI) researches, such as using pronouns to refer to surrounding objects. Therefore, using "it" instead of the object name, which is the indirect question in our case, is very likely to happen in real-life scenarios. Furthermore, with advanced devices like the Apple Vision Pro utilizing gaze as a crucial interface for spatial computing, the integration of gaze tracking into Vision-Language Models (VLMs) presents an important yet unexplored challenge.

Another important use case of incorporating gaze is visual question answering for visually impaired people. The VizWiz-VQA-Grounding, which we also evaluated VOILA on in Appendix I, presents questions collected with blind people and the majority of these questions have at least one pronoun in them.

W2 & Q3 on the process of gaze collection The main experiment in this paper is the gaze-incorporated VLM, which refers to Voila. To train Voila, one challenge lies in the training data, where it is hard to collect a large-scale gaze VQA dataset. Alternatively, we used a mouse trace dataset called Localize Narrative as a substitution and constructed visual questions using GPT-4, thus we derived a trace-augmented VQA dataset, which we refer to as VOILA-COCO in this paper. The above is the main workflow of our project. However, we need to prove that mouse trace can be an approximation for gaze trace, even though this approximation has been used for many HCI research already. This is the reason why BubbleView is used in Section 2: we want to illustrate one basic belief of our work before we dive into it. To illustrate this, the VOILA-GAZE dataset we collected already contains gaze trace data because it is collected by gaze-sensing smart glasses, then we use BubbleView to derive mouse trace annotation on the same dataset. By comparing the two trace modalities, we provide strong evidence for using VOILA-COCO as the training dataset for a gaze-incorporated VLM.

W4 & Q4, Q5 on Evaluation We choose Otter and kosmos-2 for several reasons:

these two models each represents a typical VLM architecture. Current dominat VLM architectures are either cross attention based where visual feature is injected to each layer of language backbone, or token based where visual tokens were treated similarly to language tokens in an autoregressive manner. For the baselines we've chosen, Otter and Kosmos-2 are both fully open-sourced and received hundreds of citations, representing each type of VLM architecture mentioned above.
Compared to other Open-flamingo based VLMs, Otter is tuned on a Multi-modal in-context instruction tuning dataset (MIMIC-IT), which we believe that building upon Otter should be the best choice for a user-centered QA manner like Voila, because Voila's use case also involved in-context querying and need to follow user's instruction.
Compared to other VLMs that have similar architecture to Llava, Kosmos-2 has grounding ability by inputing bounding-boxes, which is the most common strategy of incorporation location information currently, thus forms a comparable baseline against our location injection strategy.

There are 2 major concerns about using human preference:

For the majority of questions in VOILA-COCO, the correctness of an answer can be judged with a reference answer and ground truth description given, while human preference is necessary for tasks that can not be judged objectively such as writing or image generation.
The goal of this paper is to establish a groundwork for gaze-facilitated VLMs, therefore reproducibility is very important. This yields the demand for an autonomous evaluation pipeline instead of human preference, which results may vary depending on multiple factors, such as region/religion/education/gender/etc. However, as a user intention incorporated VLM, reflecting user preference during the evaluation process is indeed a matter. We believe the GPT-4 ranking metric can reflect user preference since it is trained with tons of user preference data through RLHF, further, we believe that the user preference data used in GPT-4 should generally be more unbiased. This has also been demonstrated in many other works such as GPTScore[1], GPTRank[2].

[1] Fu, Jinlan, See-Kiong Ng, Zhengbao Jiang and Pengfei Liu. “GPTScore: Evaluate as You Desire.” ArXiv abs/2302.04166 (2023): n. pag. [2] Liu, Yixin, Alexander R. Fabbri, Pengfei Liu, Dragomir R. Radev and Arman Cohan. “On Learning to Summarize with Large Language Models as References.” ArXiv abs/2305.14239 (2023): n. pag.

2024-08-13

I thank the authors for providing the detailed clarification and appreciate the effort.

Regarding motivations: the current statement doesn't mention anything related to "interpretability and effectiveness" which was originally mentioned in L8, instead it focuses on using gaze in applied settings to refer to objects. What are the relevant HCI works that show languages are ambiguous? Gaze is also ambiguous as people naturally move their eyes and identifying the right fiction that is intent-relevant is also not trivial.

I might miss something here, how is gaze going to help visually impaired people?

Could you please comment on this? Thanks!

2024-08-14

We would like to clarify the points raised and ensure the intended message of our work is conveyed effectively.

Regarding interpretability, it is indeed referenced once in the abstract to conceptually emphasize that a VLM that is well-aligned with the user's gaze intention can yield responses that are more comprehensible and practical for human users. This concept is central to our research and our contributions in this area are clearly stated in the introduction for your convenience.

Concerning the role of gaze in assistive technology, we acknowledge in the introduction the comprehensive review by Zhang et al. [58], as well as recent human-computer interaction (HCI) research, such as GazeGPT [1], which examines similar trends. Our argument is not that gaze is the sole determinant of a model's response but rather that it is a significant signal that, when integrated with language, can enhance the user experience. Our empirical work demonstrates that current VLMs often struggle with ambiguous language, which is prevalent in everyday communication. By incorporating gaze data, our model aims to interpret user intent more effectively, a finding supported by both qualitative and quantitative evidence in our study.

It is also crucial to highlight that a substantial demographic of visually impaired individuals, including those with Amblyopia, myopia, and astigmatism, can provide accurate gaze signals despite their visual limitations. For these users, a VLM that can accurately align with their intentions can serve as an invaluable aid, a point that our research addresses and substantiates.

We appreciate your attention to these details and are happy to provide further information if required.

[58] R. Zhang, A. Saran, B. Liu, Y. Zhu, S. Guo, S. Niekum, D. H. Ballard, and M. M. Hayhoe.556 Human gaze assisted artificial intelligence: A review. IJCAI : proceedings of the conference,557 2020:4951–4958, 2020. [1] GazeGPT: Augmenting Human Capabilities using Gaze-contingent Contextual AI for Smart Eyewear

审稿意见

评分: 7置信度: 32024-07-13

In their paper, Voila-A: Aligning Vision-Language Models with User’s Gaze Attention, the authors introduce a dataset, a new model architecture Voila-A, which is a cognitively-enhanced VLM. After motivating the research and introducing both the datasets and model design, the authors cover a lot of experiments on the new open-source benchmark dataset. After evaluating their model and some baselines, they perform an ablation study and conlude the paper.

优点

New open-source dataset in a low-resource modality (gaze, trace). New open-source benchmark constructed from the dataset mentioned above. New open-source VLM model (including training procedure etc.), based on OpenFlamingo that integrates both gaze and trace. In detail description of the model and each layer as well as experimental details. Mostly nicely written. Extensive ablation study.

缺点

Missing comparison to previous VQA/VLM models, e.g. Sood et al. [43], only one baseline mdel (Otter-Base). Unclear how hyperparameter were tuned. Presentation needs some work e.g. 223, 249, 276, replace \cite{ by ~\cite{ for more readability.

问题

What does Voila-A stand for? Why didn't you compare to more baseline models?

局限性

Yes

作者回复

2024-08-07

We sincerely thank you for your support of our work and for your attentive reading and thoughtful suggestions.

Baseline models

Note that except for Otter, we also have Kosmos-2 as baseline, as shown in Figure 5 and 6. For reference [43], this is a paper published in 2020 and in their experiment description, they stated that all the models they used were published before 2019, which VLMs haven't emerged back then, therefore comparing our Voila to their models may be unfair and liekly to have obvious outcome. We choose Otter and kosmos-2 for several reasons: (1) these two models each represents a typical VLM architecture. Current dominat VLM architectures are either cross attention based where visual feature is injected to each layer of language backbone, or token based where visual tokens were treated similarly to language tokens in an autoregressive manner. For the baselines we've chosen, Otter and Kosmos-2 are both fully open-sourced and received hundreds of citations, representing each type of VLM architecture mentioned above. (2) Compared to other Open-flamingo based VLMs, Otter is tuned on a Multi-modal in-context instruction tuning dataset (MIMIC-IT), which we believe that building upon Otter should be the best choice for a user-centered QA manner like Voila, because Voila's use case also involved in-context querying and need to follow user's instruction. (3) Compared to other VLMs that have similar architecture to Llava, Kosmos-2 has grounding ability by inputing bounding-boxes, which is the most common strategy of incorporation location information currently, thus forms a comparable baseline against our location injection strategy.

Presentation Issue

Hyperparameter settings can be find in Appendix F.
We will proof read again and fix format issues.

审稿意见

评分: 7置信度: 32024-07-13

This paper proposes Voila-A, an instruction-tuned VLM that integrates “gaze” information. The instruction data is constructed using tracing data from “Localized Narratives” and GPT-4 assistance, then converted to gaze-like data using the Bubbleview method. The overall approach is evaluated on newly collected datasets: Voila-COCO with 1900 questions (in the test set) and Voila-GAZE with 200 questions. A GPT-4 aided ranking of Q&A is employed to compare Voila-A against Otter (image input only) and Kosmos-2 (also takes bbox inputs), where Voila-A is shown to out-perform the baselines on both the synthetic and real-world evaluation datasets.

优点

The authors motivate their case for a gaze-guided instruction-tuned VLM well. Understandably, collecting sufficient real-world gaze data can be challenging, so the authors propose to convert trace data from the Localized Narratives dataset to pseudo-scanpaths using Bubbleview - this is a creative and realistic solution to their problem. In addition, the authors collect real-world data using the Pupil Labs Invisible eye tracker from 21 participants and show that their proposed Voila-A out-performs contextual (Otter) and bbox-guided (Kosmos-2) approaches as assessed by GPT-4.

缺点

The proposed method in terms of the Voila Perceiver Resampler and Block (VPR and VPB) generally make sense. However, the decision on how to define the K, Q, and V are insufficiently motivated. Ideally, it would be good to see at least a simple ablation study to justify the design decisions that depart from the original Flamingo architecture.

The main results are easy to understand, especially when viewed with the Appendix in mind. However, it is a purely GPT-based evaluation and while it is a good effort, one wonders whether there could be a more objective and quantitatively distill-able way of measuring performance. For example, one could evaluate referring tasks or use a user survey to evaluate the methods against each other. A user study could be particularly helpful, as it would evaluate the quality of the VOILA-GAZE dataset simultaneously (and it only consists of 200 questions).

问题

What is “1k gaze” (mentioned on page 3)?
It is mentioned in Sec. 3.4 that “the g in each VPB are initialized as 0”. How do these gating values look post-training? Does the value increase in later layers, or earlier layers? Why not initialize to random or non-zero-constant values?
Would Voila-A perform well in referring expression comprehension tasks such as RefCOCO, in comparison to Kosmos-2?

局限性

The authors briefly discuss the limitations of their work in Appendix A.

作者回复

2024-08-06

Dear reviewr 3dvo, we sincerely thank you for your support of our work and appreciate your thoughtful suggestions that have helped us improve.

W1 Regarding the Perceiver's choices for Key (K), Query (Q), and Value (V), we can delve into a clearer explanation here. The primary function of this component is to condense visual information into compact latent representations, which can subsequently be utilized for cross-attention with language models. In this setup, the Q are derived from the latent representations, while the K and V originate from the visual inputs. However, calculating K and V from the latents as well allows for an efficient self-attention mechanism within the latents themselves. In section 4.3.2 and Figure 16, we compare more design choices beyond Flamingo structure.

W2 Thank you for your kind suggestion. Our primary goal is to establish a reproducible automated benchmark, and as such, most of our evaluations are conducted without the inclusion of supplementary user surveys. To guarantee that the question-answer pairs are amenable to automatic evaluation, each reference answer has undergone meticulous review by our dedicated annotator and has been subject to a rigorous double-checking process during the selection of the final test cases. We acknowledge the potential benefits of a user study on the voila-gaze results and will take into consideration the incorporation of such a study to further validate the dependability of our automated metrics.

Q1 Note that Section 2's motivation is to demonstrate the similarity between gaze trace and mouse trace so that training with VOILA-COCO can make sense. Therefore, we collected some mouse trace data on the VOILA-GAZE dataset to compare with its gaze trace. For this comparison, the gaze trace contains 1k sample of gaze points.

Q2 We observe that the value of 'g' is larger in the median layers, whereas it is relatively smaller in both the later and the earlier layers. We initialize 'g' to zero to guarantee that the output matches that of the original model at the start of training.

Q3 RefCOCO's referring expression comprehension task requires models to predict a bounding box with a given referring phrase, however, our VOILA model does not possess the ability to generate bounding boxes as it is not trained on such data. Instead, we were able to test VOILA's overall referring expression comprehension ability using VizWiz-VQA-Grounding dataset (as shown in Appendix I). VizWiz dataset is specially collected for answering visual question for blind people and the majority of VizWiz-VQA-Grounding dataset contains pronoun usage in the questions. Therefore, to accurately answer VizWiz's question, the model should understand what the pronoun refers to in the image.

2024-08-12

Thank you very much for your explanations and clarifications. It helps me understand the paper better.

While I like the work very much, the purely synthetic nature of the evaluation is somewhat concerning. I see that in your response to Reviewer qe53, you mention the GPTRank paper which was recently accepted at NAACL 2024. In that paper, the authors show that LLM-based evaluation and human-based evaluation can have differences. Though we are looking at different tasks (summarization vs VQA), I wonder if it is essential to include some human evaluation in a work such as Voila-A.

I understand that the GPT-4 ranking method was introduced to allow your evaluations to be reproduced.

However, would you say that in your work, sufficient evaluations were done to demonstrate the necessity of integrating gaze information into VLMs?

2024-08-14

Thank you for your insightful comments. We acknowledge the importance of a comprehensive user evaluation and appreciate your concern in this regard.

Due to the constraints of time, we were able to conduct a preliminary user survey involving five participants. Each participant was asked to evaluate 50 queries using both Voila and Otter. The outcomes of this survey indicated that Voila had a winning rate of 52%, a tie rate of 36%, and a loss rate of 12%. These results are in close agreement with the GPT-rank findings presented in our main results, suggesting a consistent performance trend.

We recognize the value of a more extensive user study and are committed to conducting a thorough evaluation on the entire Voila-Gaze test set. We intend to include this expanded analysis in the camera-ready version of our paper, which we believe will provide a more robust validation of our findings.

审稿意见

评分: 7置信度: 42024-07-18

The paper aims to improve the integration of VLMs in real-world applications such as AR/VR by making the interaction seamless and user-centric. This is achieved by incorporating users’ gaze information into the VLM for a more natural conversation and a frictionless experience. First, the LN-COCO dataset is proposed as a benchmark to train models for this task, after experimentally validating that mouse traces can be used as a proxy for gaze behavior. Furthermore, the authors collect and annotate VOILA-GAZE with real gaze data to be used as another test set. Second, the authors introduce the VOILA architecture which takes an image, a gaze heatmap, and a related question as input, and provides an answer as output. This model is compared against two popular VLMs on the two benchmarks using two newly introduced performance metrics. Ablation studies are also conducted to validate model design choices and training schedules.

优点

The paper is really well-written, easy to read and sufficiently illustrated. The motivation is thoughtfully outlined, and the contribution is clearly stated and properly contextualized relative to prior works.
The paper addresses an important and original research topic that is severely underexplored despite being very relevant and holding significance for certain applications.
The introduced benchmarks VOILA-COCO and VOILA-GAZE are novel, and will be useful to the broader community working on this topic.
Overall, the proposed Voila model seems to outperform other baselines (Otter and Kosmos-2) on both datasets according to the chosen metrics.

缺点

The biggest problem I have with the paper concerns the main experiments. The authors compare Voila to Otter and Kosmos-2, however important information about the evaluation is missing or unclear, which makes the interpretation of the results difficult:
- It is not clear whether the baseline models are trained on VOILA-COCO or simply evaluated.
- It is not explained how the authors deal with direct vs indirect questions during training and evaluation. In figures 5 and 6, are they counted as two separate instances? Or is only one of them selected? In this case, which one?
- Voila takes the gaze information as input, but it is not clear whether Otter and Kosmos-2 are fed that information as well. The comparison would be unfair in the latter case, especially since we have coreference queries with pronouns like “it” and no location information to disambiguate the intent. In L393-394, it is mentioned that Kosmos-2 can take bounding boxes as input, but does this mean it is used this way during evaluation to consider gaze input? Also, even for Otter which is not specifically designed to include location information, it would be possible to inject it in the text prompt (similar to the design in Figure 16 top-right and bottom-right)
Some quantitative results are suspiciously surprising. In table 2, the winning rate, which I assume is based on the GPT4 Ranking metric, is only 1% better for Otter with direct queries (51%) compared to indirect ones. How is this possible if Otter doesn’t have access to gaze information to disambiguate the question? I would expect many answers to be completely incorrect for any question that is not focused on an obviously salient part of the image. Also, all variants of Voila (including in-context prompt) underperform the base otter on indirect queries according to the same metric, even though Voila explicitly uses gaze. The numbers for the reward score, on the other hand, are more reasonable.
There is limited novelty in the architecture itself, since all the components are exactly like the Flamingo model. The contribution of the paper is in the introduction of the gated gaze heatmap tokens as an additive soft bias to the keys K computed from the image tokens within the Perceiver module. However, as far as I’m concerned, the value of the paper lies more in investigating and establishing the groundwork for an important research direction by proposing benchmarks, protocols and baselines for more works to follow.

问题

L25-27: needs a reference or experiment to back it up. How is the limited alignment with human gaze detrimental for VLMs?
The resolution of the word clouds in Figures 10, 11, and 12 should be increased
Why is the (key of the) gated gaze feature added to (key of the) image features instead of multiplication?
Why is the VPR using an attention between latent embeddings, and a concatenation of latent embeddings and image tokens (+gaze tokens) instead of the more natural cross-attention from the latent embeddings to the image+ gaze tokens?
Is the learnable gating parameter $g$ a scalar or vector?
L274: What is meant by “the input and output embeddings of the language encoder”?
Is the model trained for 2 (L272-274) or 3 (L737) epochs?
Related to the previous point, the color of the Voila Perceiver Resampler module in figure 4 is inconsistent with the text. The figure says that the Perceiver is fine-tuned in the first stage, but the text says otherwise (L273-274 and L379-381). Which one is it?
Is the learnable $g$ part of the “gaze-related weights” that are fine-tuned in the first step?
What are the “linear layers” used to encode the gaze heatmap? Is it a linear projection on each patch token? or an MLP?
Can you elaborate more on the role of [fixation] token?

局限性

The authors properly address the limitations of their work together with useful discussions, an impact section, and a reproducibility statement.

作者回复

2024-08-06

We sincerely thank you for your support of our work, for your meticulous review, and for your thoughtful suggestions that have helped us improve.

Both Kosmos-2 and Otter are finetuned on the same training set of VOILA-COCO.
During the data generation process, we generate one direct question and one indirect question for each query content. During the training process, we randomly select one question from each question pair, i.e., the direct questions and indirect questions are expected to be balanced during the training stage. This training question composition remains unchanged for all tables and figures in the Experiment session. During the validation process, by default, we use only indirect questions (Figures 5 and 6, Tables 3 and 4). The only exception is Table 2, which shows the ablation results on query types, row 1/3/5 use indirect questions and row 2/4/6 use direct questions as validation set.
During evaluation, kosmos-2 takes bounding boxes as input, following the same format as in their paper and code repo. The way we preprocess bounding boxes for kosmos-2 is consistent with VOILA's variant in Table 3 (row 3 and 4). Specifically, we calculate the minimum bounding box that includes the gaze trace of each question. For Otter evaluation, we do not directly inject gaze traces or bounding box data because Otter hasn't seen such kind of data during training. As shown in Figure 16 you've mentioned, the top-right and bottom-right setting maintains the same model architecture as Otter. These settings add special trainable tokens for the gaze trace/fixation bounding box, which can be considered as a location information-injected version of Otter. As shown in Table 3, our VOILA also surpasses these Otter variants.
In Figures 5 and 6, although we compared VOILA to no location information-injected version of Otter, it is because the main goal of our paper is to (1) examine the necessity of additional signals for generating appropriate responses in everyday scenarios, and (2) evaluate whether gaze signals offer a superior representation of user intent compared to other modalities such as bounding boxes, instead of establishing a traditional benchmark for competitive model architectures based on specific metrics.

W2 The GPT-4 ranking has 3 outcomes, that is, win, tie, or lose. As shown in Figure 5 "Voila vs Kosmos-2" Grounding column, approximately 40% winning rate already surpass its competitor. Therefore, in Table2, comparing row 1/3/5, VOILA surpasses Otter in indirect questions, and comparing row 2/4/6, VOILA surpasses Otter in direct questions. Also, comparing row 1/2, 3/4, 5/6, both model variants and prompt strategy performs better on direct questions compared to indirect ones. We will update tie/loss rate into the table in later revision.

W3 As you kindly point out, instead of chasing technical novelty, our paper aims to implement modifications and prove them to be necessary and beneficial, establishing the groundwork for gaze-incorporated VLMs. In developing VOILA, a critical consideration is the adherence to the established architecture of current VLMs, making it easy to follow and generally applicable. It is essential to avoid introducing a significant number of new parameters or making extensive structural modifications. This constraint is due to the limitations of the current gaze dataset, which does not support large-scale pretraining. Additionally, we must be vigilant in preventing catastrophic forgetting during the fine-tuning process.

Q1 Thanks for pointing out and we will update our statement accordingly: Our study on Otter demonstrate that current VLMs fails on various daily usecases because of disalignment with users' intention. Also, recent HCI research like GazeGPT[1] incorporates only one gaze point as location information to GPT-4v in the form of bounding boxes, illustrating the improved alignment and user preference after the incorporation of gaze. [1] GazeGPT: Augmenting Human Capabilities using Gaze-contingent Contextual AI for Smart Eyewear Q2 Will update high resolution illustrations.

Q3 For better numerical stability, we choose addition instead of multiplication.

Q4 This architecture, drawing inspiration from Perceiver and Flamingo, seamlessly integrates self-attention within latent features and cross-attention between latent and visual features in a unified step.

Q5 The vector $g$ is a trainable multi-channel variable with dimensions corresponding to (Number of Heads, Dimensionality).

Q6 That means word embedding and final projection after the last layer of language model.

Q7,8,9,10 Sorry for the mismatch and we will revise it. Here is the fact: In the first stage colored in pink, we train the MLPs and gate $g$ of gaze inputs for 1 epoch, then tune together with Perceiver resampler module for 1 epoch. In the second stage colored in orange, we finetune MLPs, $g$ , Perceiver, Cross attention, word embeddings together for 1 epoch. In total, that's three epoch as stated in L737.

Q11 To maintain the model's versatility in handling scenarios without trace input, the inclusion of a special token like 'fixation' is crucial for ensuring the model's robustness across diverse situations.

评论- Clarification

2024-08-12

Thank you for taking the time to write a rebuttal. I will be taking it under advisement.

That being said, could you elaborate more on W2? I'm still not sure I'm interpreting the results properly.

My understanding from your response is that Table 2 is missing the tie and loss rates, which means the win rate alone is not enough to actually determine which method is winning. For example, comparing row 3 to row 1, VOILA is achieving a win rate of 41% compared to Otter-base on coreference queries, if the tie rate is 19% or higher, this would mean that VOILA is better, otherwise, Otter-base would be better. Is my reasoning correct? If the answer is yes, then how are we supposed to interpret those results based on the win rate alone? The scenario you describe in Figure 5 where Voila has around 40% win rate against Kosmos and is winning overall may not be happening in Table 2 if the tie rate is smaller.

Thank you.

2024-08-14

You are correct in your understanding. As replied in W2, we acknowledge that presenting only the winning rate for each table does not sufficiently substantiate our claims in the ablation studies. To address this, we will incorporate the loss rate alongside the winning rate in each table. As the following table shows, when comparing both winning and loss rates, our observations remain consistent and valid.

Table 2

Methods	Question types	WR	LR	Reward Score
Otter-base	coreference query	-	-	-1.91
Otter-base	direct query	0.51	0.10	0.02
Voila	coreference query	0.41	0.18	-0.79
Voila	direct query	0.62	0.15	0.14
Voila	in-context prompt + coreference query	0.46	0.16	-0.02
Voila	in-context prompt + direct query	0.77	0.12	0.20

Table 3

Methods	WR	LR	Reward Score
Otter-base	-	-	-1.91
Gaze as discrete position tokens	0.19	0.25	-2.44
Gaze bounding box as image patch	0.36	0.20	-1.26
Gaze bounding box as discrete position tokens	0.21	0.22	-1.72
Voila (Gaze as heatmap)	0.41	0.18	-0.79

Table 4

Layers fine-tuned	WR	LR	Reward Score
Otter-base	-	-	-1.91
Otter-base vision perceiver+cross attention	0.25	0.24	-1.78
Voila gaze weight	0.24	0.20	-1.52
Voila gaze weight+LORA	0.23	0.21	-1.02
Voila gaze weight ->perciever+cross attention	0.41	0.18	-0.79

最终决定Accept (spotlight)

2024-09-25

Most of the reviewers appreciate the proposed method and dataset for aligning gaze and the VLM model. Although there are co-current works as I can see from arXiv, the proposed dataset and method have significant contributions to the community. One reviewer raised multiple concerns about the paper, and the author replied with a detailed rebuttal. I suggest to accept this paper.