DiffEye: Diffusion-Based Continuous Eye-Tracking Data Generation Conditioned on Natural Images
We propose DiffEye, a diffusion-based training framework designed to model continuous and diverse eye movement trajectories during free viewing of natural images.
摘要
评审与讨论
This paper concerns the modelling of human overt visual attention (gaze position) when free-viewing images. The paper identifies two problems in the existing literature: first, that models usually target fixations and scanpaths rather than raw eye trace data, and second, that models ignore individual differences. The paper presents a generative model of raw human gaze trace data, using an image-conditioned diffusion model and an alignment procedure "Corresponding Positional Embedding". The resulting gaze samples are compared qualitatively and quantitatively to existing scanpath models and saliency models. The authors also conduct a series of ablation studies to quantify the contribution to model performance of various mechanisms or design decisions.
优缺点分析
The paper is good quality and generally clearly-written, except for some details in the methods that are unclear to me. I find the approach potentially significant in advancing modelling of human gaze behavior in free viewing tasks, and the diffusion approach is sufficiently novel. The results compare generally favourably to other methods.
I think the biggest weakness of the paper is that it's not clear to me what it is about modelling the gaze trace data that seems to help (see Questions section below). In addition, while the model is motivated by the need to account for individual differences in gaze behavior, I see no evidence that this model currently actually does this. The data is described in general as collected from different subjects, but then as far as I can tell the model is only conditioned on the stimulus image, not the subject. Assuming I have not misunderstood this, then the individual differences aspect must be clearly described as a possible future extension, rather than something the model currently allows.
问题
1. Why does this work?
A main take-home message of the paper is that using the full gaze data samples rather than fixation or fixation sequences allows learning more effective representations. Why is this? The paper currently does not seem to answer this question.
I could imagine that one possibility is that there is simply more data. But this seems unlikely, if we consider that the sample numbers are relatively small compared to larger eyemovement or related datasets (e.g. Salicon).
Instead, I assume the reason is because the model can learn more about the spatial and temporal characteristics of eye movements from the raw gaze traces (for example, the curved trajectories of saccades, which the model seems to produce, or the acceleleration of the eye, which predicts saccade amplitude (the main sequence)). It would be nice to know the relative contribution of these aspects to the model's performance.
To do this, the authors could train on modified datasets, in which the gaze samples are modified to linearly interpolate between sequential fixation positions in space, time or both. This would selectively remove the saccade curvature and speed information, respectively. Adding models trained on this "ablated" data would offer an explanation for what information is specifically missing from the previous training sets (that only include sequences of fixations), which would potentially inform how experimentalists collect the most "useful" eye movement data.
2. Choice of 10 fixations per scanpath for other models
line 244: DeepGaze III and IOR-ROI are set to 10 fixations per scanpath. How do the results depend on this setting? If you classify DiffEye's trajectories into fixations, how many fixations are produced on the same stimuli?
3. Predictions too concentrated
Some of DiffEye's predictions seem too concentrated (for example: Figure 3, first two images; DiffEye samples contain no fixations on the rear of the wagon or the second person). Which of the considered metrics are most sensitive to the overall spread of predictions, and are these qualitative observations consistent with differences in these metrics?
局限性
As above, clarify whether the paper does in fact include individual-level modelling / conditioning, or whether that is only a possible extension.
最终评判理由
Given only responses to my comments, I would keep my score as is (4). However, I find the authors' responses to the other reviewers to be quite convincing, especially the inclusion of gaze statistics and the new sequence scores in response to 6vm6. (Note however that I suspect there might be something off with the distribution summaries for the direction and angle, which are circular variables). Therefore, I raise my score to 5.
格式问题
None.
We sincerely thank the reviewer for their detailed assessment and insightful questions. We are glad they recognize the potential significance and novelty of our diffusion-based approach for modeling human gaze. Below, we address the weaknesses and questions raised in the review.
I think the biggest weakness of the paper is that it's not clear to me what it is about modelling the gaze trace data that seems to help (see Questions section below). In addition, while the model is motivated by the need to account for individual differences in gaze behavior, I see no evidence that this model currently actually does this. The data is described in general as collected from different subjects, but then as far as I can tell the model is only conditioned on the stimulus image, not the subject. Assuming I have not misunderstood this, then the individual differences aspect must be clearly described as a possible future extension, rather than something the model currently allows.
The reviewer correctly points out that our model is conditioned on the stimulus image and not on a specific subject ID. We apologize for any ambiguity in our initial framing. Our primary motivation was not to model specific individuals, but rather to capture the inherent variability observed across different human subjects when viewing the same image, and we will clarify this in the final version of the paper
Traditional deterministic models predict a single "average" scanpath, which fails to represent the stochastic nature of scanpaths. Our generative approach, by training on raw trajectories from multiple subjects, learns a distribution of plausible gaze behaviors for a given image. The goal is to generate samples that reflect the diversity seen in the human population, not to replicate a particular person's gaze pattern. As the reviewer suggests, conditioning on subject-specific information is an interesting and important direction for future research, building upon the foundation we establish here for modeling population-level gaze dynamics.
Question 1: Why does modeling raw gaze traces work better?
This is an excellent question. The reviewer rightly intuits that the benefit goes beyond simply having "more data." Our hypothesis, supported by our results, is that raw eye-tracking trajectories contain a wealth of spatio-temporal information that is discarded when input data only is scanpaths. For instance, in the MIT1003 dataset, the average raw trajectory has about 724 timesteps, while the average scanpath has only 8.4. This lost information includes:
- Saccadic Dynamics: The precise curvature, velocity, and acceleration profiles of saccades are lost in scanpath representations. Our model can learn these subtle but important characteristics from the continuous trajectory data.
- Suitability for Diffusion Models: Diffusion models excel at learning complex distributions over continuous, high-dimensional data. Raw gaze trajectories, which are essentially smooth time-series data, are a natural fit for this modeling paradigm. In contrast, scanpaths`are discrete, lower-dimensional sequences, and modeling them directly with diffusion is less straightforward.
The reviewer's suggestion to train on linearly interpolated trajectories is a very insightful way to perform an ablation study. Such an experiment would allow us to precisely disentangle the contributions of saccade curvature and timing. It is a fantastic idea for a follow-up study. Our paper's primary contribution is to demonstrate that using a diffusion framework on the full, raw trajectory data is not only feasible but also outperforms methods that rely on only scanpaths, with the caveat that our experimental findings are limited to datasets containing the raw eye tracking data (a limitation we will make clear in the final paper)..
Question 2: Choice of 10 fixations for other models.
We thank the reviewer for this important question. The decision to set the number of fixations to 10 for models such as DeepGaze III and IOR-ROI was not an arbitrary choice, but rather a constraint imposed by the models themselves. These specific baselines are designed to generate scanpaths of a fixed, predetermined length and cannot produce variable-length outputs.
While we acknowledge that evaluation metrics can be sensitive to scanpath length, our empirical tests showed that the performance of these baselines did not change significantly when this parameter was varied. We therefore adopted a length of 10 for consistency across these models.
This constraint highlights a key advantage of our method. DiffEye generates a continuous, fixed-duration trajectory (720 timesteps), from which fixations are extracted via post-processing. Consequently, the number of fixations is an emergent property of the generated trajectory, allowing it to vary naturally for each sample, even for the same image. This behavior more closely mirrors that of human observers, whose scanpath lengths are not predetermined.
Number of Fixations
| Model | mean (MIT1003) | mean (OSIE) |
|---|---|---|
| Ground Truth | 8.40 | 9.40 |
| Ours | 8.48 | 8.08 |
| HAT | 12.04 | 12.37 |
| Gazeformer | 3.98 | 4.09 |
| Chen et al. | 10.04 | 12.00 |
Results are showing the average number of fixations in the generated/ground truth scanpaths for both datasets. As the results show, our model's ability to generate variable-length scanpaths allows it to better match the statistical properties of human gaze across different datasets. For the MIT1003 dataset, the average number of fixations in the ground truth is 8.40, while our model produces scanpaths with 8.48 fixations. Similarly, for the OSIE dataset, our model generates 8.08 fixations per scanpath, closely approximating the ground truth of 9.40. This consistent alignment with human data demonstrates a clear advantage over models that cannot produce variable-length scanpaths.
Question 3: Predictions seem too concentrated.
We thank the reviewer for this keen qualitative observation. It is true that in some instances, as shown in Figure 3, the model's generated scanpaths can be more concentrated than the ground truth distribution.
This observation touches on the known challenge of evaluating generative models, where no single metric tells the whole story. The metrics we used—Levenshtein Distance, Discrete Fréchet Distance , DTW, and TDE—are standard in the literature and capture different aspects of sequence and temporal similarity. While our model performs very well on these metrics on average, they may not perfectly capture the "overall spread" or diversity of fixations across all salient objects.
This is precisely why we included extensive qualitative results, to provide a more holistic view of the model's performance. The reviewer's point is well-taken, and developing more robust metrics that specifically evaluate the spatial distribution and spread of generated fixations against the ground truth is a valuable avenue for future work.
We apologize for any ambiguity in our initial framing. Our primary motivation was not to model specific individuals, but rather to capture the inherent variability observed across different human subjects when viewing the same image, and we will clarify this in the final version of the paper. Traditional deterministic models predict a single "average" scanpath, which fails to represent the stochastic nature of scanpaths. Our generative approach, by training on raw trajectories from multiple subjects, learns a distribution of plausible gaze behaviors for a given image. The goal is to generate samples that reflect the diversity seen in the human population, not to replicate a particular person's gaze pattern.
Unfortunately this response makes me question the authors' appropriate understanding of this literature. First, "individual differences" has a specific meaning, which is exactly conditioning on subject-specific information. That the authors conflate this with stochastic gaze behavior is disappointing. Second, the writing in the reply implies that this model is the first to represent the stochastic nature of scanpaths. This is not true: for example, the DeepGaze family of models (e.g. [1], [2]) are explicitly probabilistic, which means that they capture stochasticity in a principled way. Please clarify this appropriately in any revision.
I think this paper would be stronger had the authors been able to include at least one a linear trajectory control condition to answer the "why" question. But I understand that this might not have been feasible in the remaining time.
Given the points above, I would keep my score as is. However, I find the authors' responses to the other reviewers to be quite convincing, especially the inclusion of gaze statistics and the new sequence scores in response to 6vm6. (Note however that I suspect there might be something off with the distribution summaries for the direction and angle, which are circular variables). Therefore, I raise my score to 5.
Finally, I'd like to point out regarding questions and responses to some of the other reviewers, that in general, video-based eyetracking data will not have sufficient spatiotemporal resolution to reliably measure microsaccades. Microsaccades are typically defined as rapid eye movements of less than 30 arcmin (0.5 deg; see e.g. [3]). Typically, even the top-range research eyetrackers don't have accuracies reliably < 1 deg. So in my view, none of the published free viewing datasets contain sufficiently sampled data to allow learning of microsaccades, much less other fixational eye movement properties such as ocular drift.
[1] Kümmerer, M., Bethge, M., & Wallis, T. S. A. (2022). DeepGaze III: Modeling free-viewing human scanpaths with deep learning.
[2] Kümmerer, M., Wallis, T. S. A., Gatys, L. A., & Bethge, M. (2017, October). Understanding Low- and High-Level Contributions to Fixation Prediction.
[3] Rucci, M., & Poletti, M. (2015). Control and Functions of Fixational Eye Movements.
Thank you for your detailed feedback and for raising your score. We appreciate you pointing out the discussion regarding "individual differences" and the context of our model's stochasticity. We will carefully revise the manuscript to correct our terminology, properly situate our work with respect to prior probabilistic models like DeepGaze.
The work proposes DiffEye, a model that generates fixation trajectories conditioned on the regions inducing the fixations. These trajectories are learned from MIT1003 [18], a dataset that provides the stimuli, the images used for tracking observers' gaze, and the vectors of fixations at (x, y) locations on each image for fifteen observers.
The model proposed is based, according to the paper's claims, on DDPM [56], a diffusion model that progressively adds noise to data and trains a U-Net for the reverse process. The work introduces a corresponding positional embedding (CPE) to associate fixation trajectory tokens with image patch tokens.
Comparisons are made with state-of-the-art representatives by testing the competitor methods on the MIT1003 selected test set and the OISE test set.
Four metrics have been chosen from [68]. Based on the chosen metrics and tests run on the competitor's algorithms and code, the results show a significant improvement in scan path generation by the DiffEye model.
优缺点分析
Strength:
-
The primary objective of the paper is to model eye movements with a temporal structure and learning to generate these trajectories conditioned on image semantics. This idea is of great interest, despite not being new in the field of eye movement studies.
-
The novelty of the paper lies in highlighting that the dynamics of eye trajectories can provide a better structure for interpreting saccades and what drives them, making it possible to understand priority relations in image semantics.
-
Using a diffusion-based model ensures that the trajectory dynamics are based on the Markovian relation between two successive steps. Also, it allows the introduction of a distribution over trajectories, enabling the comparison of different observers' behaviour.
-
The paper makes an effort to provide qualitative images of the trajectories of all the observers, as learned from the MIT 1003 fifteen observers. The paper discusses an ablation of the constituent components of the model, focusing on the component that connects region semantics with the locus attended by the gaze in the scan path.
-
Two metrics, namely Dynamic Time Warping and Time Delay Embedding, suggested in [68], highlight the temporal relevance of the predictions. The obtained results, compared with a representative set of competitors and under four metrics, are promising.
Weaknesses
Despite the interest in the idea, the model, and the observations made by the authors, as well as the results exhibited, the article struggles to achieve its objectives.
-
The bias reported by the authors of MIT 1003 [18], which is very significant and strongly reduces the possibility of learning general trajectories, is never mentioned.
-
In MIT 1003, we have the locations of fixations. However, one can deduce the saccades (in principle, also the microsaccades) as arc length, as they are the relevant characteristic of the temporal structure of gaze trajectories. For example, in Figure 3 of DeepGazeIII [19], a statistical analysis of saccades and fixations is made, which is missing here. Therefore, the concept of temporal structure is not demonstrated, and it is unclear how it could emerge.
-
The model is based on DDPM [56], utilising the U-Net and the Transformer's sinusoidal position embedding [58]. The denoising objective, however, lacks several aspects that are not discussed and left without rationale: How is distributed? What happens at ? Which function is approximating? How is the forward variance set?
-
The section on stimulus conditioning is too vague; several lines are used to explain why Dinov2 is not exploited, and it is ultimately concluded that FeatureUP is chosen. However, for example, the trajectory passed to the U-Net is not described in terms of sampling. The loss function does not consider or ; therefore, it is challenging to understand how the U-Net, whose role in [56] is to learn to remove the scheduled noise, learns to associate feature patches with trajectory steps here. How is CPE included in backpropagation? Note also that the forward noise implies sampling, and how the input trajectories are sampled is particularly difficult, as it needs to avoid losing essential information. Information on this aspect would be made credible using diffusion.
-
In the paragraph Data Preprocessing, it is written (line 222) that trajectories are downsampled. Like image resizing, downsampling affects the significance of the input trajectories, especially being biased at the origin and being such a small number.
-
The reviewer appreciates the effort of providing qualitative images of the gaze trajectories. However, it would have made sense to compare single observers to understand if at least one of the fifteen predicted trajectories is similar to one of the fifteen ground-truth trajectories. Unfortunately, the images are not very explanatory.
-
Concerning the experiments, note that most of the referred works do not use the chosen metrics; the advantage of the chosen metrics against the area under the ROC curve is not explained.
问题
-
In the implementation paragraph, it is written "We set the number of diffusion steps to Tdiff = 1000 during training and reduce it to 50 during sampling for improved efficiency." How are the diffusion steps applied to the input trajectories?
-
What is backpropagated through F^CPE and R^CPE?
-
What kind of temporal information is effectively obtained? Taking the derivative of the fixation locations reveals significant jumps, indicating saccades. How do you use this information? If the temporal aspect is considered only for diffusion, but there is no effect on the prediction, why should it be useful?
-
Have you made some statistics on the MIT1003 trajectories?
-
It is unclear whether you associate class tokens or feature maps with the trajectories. Furthermore, the resizing operations performed by convolution and interpolation are not described, and some examples would be helpful.
-
With Dinov2 and FeatureUp, a specific number of feature maps is obtained. How is this number reduced in the process? Which loss affects the reduction process?
局限性
The authors have addressed the limitations, though they appealed mainly to the lack of eye trajectory datasets.
最终评判理由
-
I have read the author's rebuttal and discussed with the authors the aspect that I considered unclear.
-
The authors have made a strong effort to reply to all my concerns. However, as I have written in my final considerations, their answers, especially on the learning aspects, were contradictory.
-
The authors have used a single small biased dataset for training and a minimal dataset for testing.
-
I understand that the other reviewers were particularly interested in the fact that the authors declared they can elicit special information on eye motion from the scanpath dataset. Also, one reviewer considered a plus that the scan paths are given as a set.
-
I appreciated the insight of the other reviewers, but I remain firm on my criticisms.
格式问题
There are no concerns about formatting.
We sincerely thank the reviewer for their feedback. Below, we address each of the weaknesses and questions.
W1
We acknowledge that all datasets contain biases, including the well-known center bias in many visual attention datasets. However, our model is designed to mitigate this specific issue. Because DiffEye is conditioned on patch-level image features, it learns to associate gaze with specific semantic content within the image, rather than simply memorizing a spatial prior like a center bias.
Our strong performance (see Table 3 in the original paper) on the entirely unseen OSIE dataset serves as direct evidence of this capability. This result demonstrates that our model can generalize beyond the specific characteristics and any potential biases of the MIT1003 training set. We will add a discussion to the limitations section.
W2 & Q3 & Q4
Our core hypothesis is that while fixations are where visual information is primarily processed, the full trajectory—which includes saccade velocity, curvature, and microsaccades—contains a richer, continuous spatio-temporal signal. Modeling this complete signal allows for a more faithful representation of the underlying dynamics of visual exploration, which might be useful for certain studies, and there is no cost for this flexibility as we can always extract the corresponding scanpath if needed.
Diffusion models are well-suited for modeling such continuous, high-dimensional data. The smooth nature of raw trajectories is a better representational match for these models than discrete, variable-length scanpaths. This allows DiffEye to learn the nuanced patterns of human eye movements, which is reflected in our strong performance on metrics that explicitly evaluate temporal sequence similarity, like DTW and TDE. We have extracted additional statistics from the eye tracking trajectories based on the reviewer’s suggestion. While we can’t include a figure in the rebuttal, please find the statistics in the tables below:
Saccade Amplitude (x10² pixels)
| Model | mean | std | 25% | median | 75% | skew |
|---|---|---|---|---|---|---|
| Ground Truth | 1.90 | 1.32 | 0.88 | 1.55 | 2.58 | 1.27 |
| Ours | 1.76 | 1.26 | 0.79 | 1.40 | 2.44 | 1.30 |
| HAT | 2.78 | 1.65 | 1.49 | 2.45 | 3.74 | 0.96 |
| Gazeformer | 0.99 | 0.82 | 0.35 | 0.77 | 1.41 | 1.36 |
| DeepGazeIII | 1.89 | 1.37 | 0.84 | 1.51 | 2.62 | 1.23 |
| ROI | 2.54 | 0.92 | 1.95 | 2.42 | 2.94 | 1.51 |
| Chen et al. | 1.49 | 1.84 | 0.26 | 0.57 | 2.34 | 1.64 |
Saccade Direction (degrees)
| Model | mean | std | 25% | median | 75% | skew |
|---|---|---|---|---|---|---|
| Ground Truth | 2.5 | 108 | -87.2 | -0.7 | 94.5 | 0.03 |
| Ours | -1.9 | 107 | -93.3 | -1.4 | 88.5 | 0.05 |
| HAT | 0.9 | 107 | -90.5 | 1.1 | 91.1 | -0.01 |
| Gazeformer | 9.0 | 101 | -66.1 | 4.5 | 92.6 | -0.08 |
| DeepGazeIII | -0.02 | 108 | -87.8 | -1.4 | 90.7 | 0.03 |
| ROI | 10.8 | 113 | -47.2 | 1.9 | 111 | -0.09 |
| Chen et al. | 13.2 | 97.3 | -59.0 | 0.0 | 90.0 | 0.06 |
Inter-Saccade Angle (degrees)
| Model | mean | std | 25% | median | 75% | skew |
|---|---|---|---|---|---|---|
| Ground Truth | 95.9 | 61.0 | 35.2 | 102 | 157 | -0.13 |
| Ours | 93.7 | 59.7 | 35.8 | 95.6 | 153 | -0.06 |
| HAT | 107 | 51.6 | 66.2 | 116 | 152 | -0.44 |
| Gazeformer | 48.0 | 42.3 | 15.6 | 33.6 | 71.9 | 1.11 |
| DeepGazeIII | 101 | 57.7 | 47.3 | 109 | 156 | -0.24 |
| ROI | 111 | 65.1 | 42.9 | 142 | 170 | -0.50 |
| Chen et al. | 119 | 54.9 | 82.9 | 135 | 169 | -0.67 |
Number of Fixations
| Model | mean (MIT1003) | mean (OSIE) |
|---|---|---|
| Ground Truth | 8.40 | 9.40 |
| Ours | 8.48 | 8.08 |
| HAT | 12.04 | 12.37 |
| Gazeformer | 3.98 | 4.09 |
| Chen et al. | 10.04 | 12.00 |
While we cannot put the statistics for OSIE dataset due to the character limitation, and can't upload figure, we will put them to the final version. This analysis demonstrates that our approach generates scanpaths whose statistical properties closely mirror the ground truth.
W3 & W4 & Q1 & Q2 & Q5 & Q6
We will significantly expand our methods section to provide the following clarifications.
-
Diffusion Process Details:
- Forward Process: For each training sample, a diffusion timestep is uniformly sampled from {1, ..., T_diff}, where . At , we have the original, noise-free trajectory . Noise is added to according to a pre-defined linear noise schedule with variance ranging from to , producing the noised trajectory .
- Reverse Process: The U-Net, , is trained to predict the total noise that was added to the original trajectory to obtain . For inference, standard diffusion models (DDPMs) can be very slow, as they require reversing the process through all 1000 steps. To improve efficiency, we use a DDIM [66] scheduler. DDIM introduces a more flexible, non-Markovian diffusion process that allows the model to take larger "jumps" during the reverse denoising phase. Instead of taking 1000 small steps, DDIM enables high-quality sample generation in significantly fewer steps. This allows us to approximate the reverse process in just 50 steps at inference time, greatly speeding up generation without a major compromise in quality.
-
Stimulus Conditioning and CPE Details:
- Conditioning Mechanism: We use feature maps, not class tokens, for conditioning. FeatUp provides a high-resolution feature map for a given image. We interpolate this map to a grid, resulting in 1024 patch tokens. These patch tokens serve as the key and value inputs to cross-attention layers located within each block of our U-Net. The trajectory tokens (augmented with positional information) serve as the query.
- Backpropagation and CPE: The loss function is the L2-norm between the ground-truth noise and the U-Net's predicted noise . Our proposed CPE module is a simple, parameter-free addition operation. Because it is just an addition, gradients flow directly through it to the layers that precede it. This end-to-end process forces the U-Net to learn the correspondence between the visual content (from the patch features) and the gaze location (from the trajectory) to successfully predict and remove the noise. The backpropagated gradients update all trainable parameters, including the U-Net weights, the cross-attention modules, and the initial linear layers that project the raw inputs into the shared embedding dimension.
W5
We will clarify this phrasing. The vast majority of trajectories in the dataset were shorter than our target length and were padded. A small fraction of sequences that were slightly longer than 720 timesteps were minimally downsampled to fit the fixed-length requirement of the model architecture. Most importantly, the core temporal structure was preserved. We will make it clear that most trajectories were truncated or padded, and only a few were lightly downsampled, with a negligible effect on the signal.
W6
Our intent in the main paper (e.g., Figure 3) was to illustrate that the entire set of our generated trajectories successfully captures the overall distribution of the ground-truth trajectories from all 15 observers. This demonstrates our model’s ability to learn population-level variability, not just replicate a single average path.
However, your point about showing a direct one-to-one similarity is crucial. Quantitatively, the"Best" score in Tables 2 and 3 is designed to capture exactly this: it reflects the average distance between each ground-truth scanpath and its single closest match among all generated scanpaths. To make this more intuitive, we will add a new qualitative figure to the appendix showing these one-to-one "best match" comparisons to visually demonstrate our model's ability to replicate specific human gaze patterns.
W7
The metrics were chosen to fit our primary task of generating dynamic sequences, which differs fundamentally from static saliency prediction. Metrics like the AUC are excellent for evaluating static saliency maps, but they are agnostic to temporal order, path shape, and the dynamics of how a scene is explored. They cannot distinguish between two different scanpaths that happen to cover the same salient regions in a different order.
In contrast, metrics like DTW and Levenshtein Distance are specifically designed to measure the similarity between two entire sequences.
- DTW is particularly valuable as it aligns sequences that may vary in speed, capturing similarity in the overall shape and flow of a path.
- Levenshtein Distance measures the "edit distance" between two fixation sequences, providing a robust measure of similarity in both the spatial locations and the order of fixations.
By using these sequence-based metrics, we perform a "distribution-wise" evaluation. A low average score signifies that the distribution of paths produced by our model is structurally and temporally similar to the human distribution. While we evaluate the secondary task of saliency prediction using standard metrics like AUC (provided in the supplementary material), our primary contribution is best assessed by these sequence-aware metrics.
We provide results on Sequence Score (SS) and Semantic Sequence Score (SemSS) on MIT1003, which further validate our choice of sequence-based evaluation, demonstrating our model's superior ability to capture the spatio-temporal dynamics of human gaze.
| Model | SS (↑) | Sem SS (↑) |
|---|---|---|
| Ours | 0.4782 | 0.6611 |
| HAT | 0.4079 | 0.5794 |
| Gazeformer | 0.3531 | 0.4522 |
| DeepGazeIII | 0.4440 | 0.6604 |
| ROI | 0.4506 | 0.6603 |
| Chen et al. | 0.4237 | 0.6397 |
Thank you for your extended answer. However:
-
W1. The authors answer "Because DiffEye is conditioned on patch-level image features, it learns to associate gaze with specific semantic content within the image, rather than simply memorizing a spatial prior like a center bias". But as written in the paper "Additionally, to improve spatial precision, we replace DINOv2 patch embeddings with those from FeatUp, a model-agnostic framework that restores fine-grained spatial details in deep features". As you say, FeatUp is model agnostic, so there is no semantic reference as with Dinov2 classTokens. Moreover, there is no reference to semantic content but w.r.t. Dinov2, which, as said, you do not use, and in the limitation paragraph. So, the reviewer does not see how the bias is faced, and the answer does not address this point.
-
W2 & Q3 & Q4 Thank you, results partially show that you can obtain temporal information on saccades, even from the fixations, but not how these are obtained from the model.
-
W3 & W4 & Q1 & Q2 & Q5 & Q6 Here, again, there is something not convincing because FeatUp is agnostic. In particular, the authors write, "Our proposed CPE module is a simple, parameter-free addition operation. Because it is just an addition, gradients flow directly through it to the layers that precede it." Namely, there are no parameters to be learned, so it does not learn to associate gaze with semantic content; therefore, the conditioning is not effective.
-
Finally, I'm sorry that no images can be uploaded to see the comparison between at least one GT scan-path and a predicted one. I really thank the authors for their effort, but given the answer, I cannot improve my scores.
Dear Reviewer 6vm6,
Thank you very much for your detailed follow-up and for giving us the opportunity to clarify these crucial points.
1. On Semantics, FeatUp, and Bias (W1)
We would like to note that the semantic content it upsamples comes directly from the DINOv2 patch embeddings.
Here is the pipeline:
- We start with DINOv2, which creates semantically rich feature embeddings for each 14x14 image patch. These embeddings capture what is in that patch (e.g., part of a face, a piece of text, a texture).
- The problem is that these patches are low-resolution. A single embedding covers a relatively large area.
- We use FeatUp on these DINOv2 embeddings. FeatUp's job is not to create new semantics, but to intelligently upsample the existing DINOv2 features to a higher spatial resolution. It restores the fine-grained detail, allowing us to have semantically rich features at a more precise, granular level (e.g., 4x4 pixels instead of 16x16).
- So, our model is indeed conditioned on high-resolution semantic features. The model learns to associate gaze points with specific local content (e.g., "look at the patch corresponding to the eye") regardless of where that content appears in the image. We will make this process explicit in the final paper.
2. On How Temporal Information is Learned (W2 & Q3 & Q4)
Our model learns these dynamics because it operates on the raw, continuous spatio-temporal gaze signal, not just a sequence of discrete fixation coordinates.
The diffusion process is trained to denoise the entire trajectory frame-by-frame. A saccade in the raw data is represented as a rapid change in (x, y) coordinates over a few timesteps. By learning to reverse the noise addition process on this continuous signal, the U-Net implicitly learns the underlying distribution of valid "moves." It learns that slow, drifting movements (fixations) are often followed by very rapid, ballistic movements (saccades) of a certain velocity and direction.
The model is not explicitly told "this is a saccade," but by training on thousands of real human trajectories, it learns that these patterns are a fundamental part of the data distribution it must replicate. The statistics we showed are an emergent property of the model successfully learning this underlying continuous dynamic process. We will add a paragraph to the methods section to clarify that we model the raw signal, which is how these dynamic properties are captured.
3. On Conditioning and Learning via CPE
Thank you for pushing for this clarification. The purpose of CPE is not to learn, but to provide a sense of order, which is essential for models that process sequence data in parallel. CPE stamps each trajectory point with its unique position in the sequence.
The actual learning of the gaze-to-semantics association happens in the trainable components that use this ordered information. The process is as follows: Each trajectory point is first passed through a learnable linear projection layer to create an initial embedding. The CPE vector is then added to this embedding, producing a final, order-aware query. Inside the U-Net, our cross-attention layers take this query and compare it against the semantic image patch features, which act as keys and values.
To successfully denoise the trajectory and minimize the loss function, the model must learn the correspondence between the query (the specific, ordered trajectory point) and the image features. The learning signal, or gradient, flows back to update the weights of the only trainable parts in this pipeline: the initial projection layer and the cross-attention modules.
In summary, CPE provides the structure of sequence order, while the model's trainable layers learn to associate that structure with the image's semantic content. The importance of this structure is confirmed by our empirical results, which show a clear performance improvement when the CPE module is added (refer to the ablation study table, Table 2).
Thank you once again for your diligent review. We will ensure these clarifications, along with the one-to-one qualitative comparisons you requested, are included in the final manuscript to make our contribution clearer.
Dear Authors, I really appreciate your fervour in defending your paper.
In this discussion, I focused on a specific point, namely the strong bias in the dataset MIT1003, also highlighted by the authors of the dataset [18].
You answered that this was not a problem since the model learns to condition on the visual stimuli. However, it turns out that the model does not learn to condition. In the paper it is written "To enhance stimulus conditioning and improve the interaction between trajectories and the image via cross-attention, we propose a novel positional embedding strategy called Corresponding Positional Embedding (CPE)."
Indeed, what came out is that:
- FeatureUP upsample Dinov2 maps, and it is agnostic.
- CPE does not learn; it is essentially a preprocessing, so how does "it learns to condition on visual stimuli", as you have written?
- Dinov2 provides unsupervised embedding for classification. In the paper, it is written "to improve spatial precision, we replace DINOv2 patch embeddings with those from FeatUp". Note that Dinov2 needs fine-tuning to learn to classify the specific dataset you pass. But, as you finally write in the answer, you just upsample the Dinov2 features using FeatureUP.
- U-net removes the noise, reversing the diffusion process, as usual.
- No loss is displayed.
How the inference is done?
So the CPE, turns out to be just a preprocessing, and I cannot see what and where is the novel contribution in learning to conditioning the scanpath to the visual stimuli.
You just added a preprocessing to a typical diffusion model using UNET. Even in Figure 2, the inference is detached.
To conclude:
- the proposed approach cannot solve the bias in MIT1003,
- the only novelty is a preprocessing step,
- it remains unclear what the model learns and how the scanpath inference is done.
In any case thank you for the discussion.
Dear Reviewer,
Thank you for the rigorous discussion.
Standard attention mechanisms are permutation-invariant, meaning they lack an inherent sense of order. Positional embeddings are crucial as they inject this missing information, allowing the model to understand the sequence or spatial arrangement of its inputs. Our work addresses the unique challenge of aligning two different data modalities: a temporal gaze trajectory and a spatial grid of image features. The novelty of our Corresponding Positional Embedding (CPE) is that it creates a shared spatial language, or a common coordinate system, to bridge this gap. It "stamps" both a gaze point and its corresponding image patch with the same positional signature, enabling a direct alignment between them.
The learning happens within several trainable components. Initially, linear layers project the raw data into a hidden representation. This representation is then enhanced with our Corresponding Positional Embeddings (CPEs) and fed into the U-Net. The U-Net's own convolutional parameters and, most importantly, its integrated cross-attention layers are where the conditioning is learned. These cross-attention modules have their own trainable parameters for generating queries, keys, and values from the CPE-enhanced inputs. In essence, cross-attention takes these CPE-enhanced inputs, where the two modalities are now spatially aligned, and updates its own parameters to correctly model the interaction between the gaze location and visual content. As depicted in Figure 2, the model is trained by minimizing the L2 loss between the ground-truth noise added at a given diffusion timestep and the noise predicted by the U-Net. This denoising objective drives the backpropagation process, updating the parameters to learn the association between a specific gaze location and the visual content at that same location.
Regarding the visual features, we use DINOv2, a powerful, pre-trained feature extractor that provides rich semantic information without needing fine-tuning. DINOv2 operates by dividing an input image into a grid of patches and produces two outputs: a feature embedding for each individual patch and a single global classification token that summarizes the entire image. For our conditioning, we leverage the grid of rich, localized patch-level embeddings rather than the single global token, allowing the model to associate gaze with fine-grained spatial content. These DINOv2 patch features are then upsampled by FeatUp, a framework that increases their spatial resolution and precision for more accurate localization. The impact of these components is empirically validated in our ablation study. As shown in Table 2, performance degrades significantly when patch-level features are replaced with a single global token (w/o Patch-Level Features). The table also shows a drop in performance when FeatUp is removed (w/o FeatUp), confirming that its high-precision features are crucial to our model's success.
The inference process is not detached from the training architecture and directly uses the learned conditioning. Critically, CPE is also applied during inference as it is a core component of the attention mechanism that the model learned to rely on. We will update Figure 2 to make this dependency clearer. The process begins with random noise, which is iteratively denoised by the U-Net. In each step, the U-Net is conditioned on the image features via the CPE-enhanced cross-attention to generate a realistic trajectory.
Therefore, the model mitigates dataset bias by conditioning on local semantic content, not just spatial priors, which is evidenced by its strong generalization to the entirely unseen OSIE dataset. The novelty is the entire DiffEye framework—the first to apply a diffusion model to raw eye-tracking data conditioned on natural images—which includes both the CPE mechanism and the deep cross-attention architecture. In summary, the model learns the conditional probability distribution of human eye-gaze trajectories given a visual stimulus.
Thank you again for your valuable feedback.
I thank the authors for the discussion and their prompt replies. I understand that it is a significant effort and stress.
The discussion has clarified some aspects that further show the weaknesses of the authors' contribution. Namely:
- There is no ground to learn the conditioning.
- The temporal structure is limited to the noise removal, typical of diffusion..
- There is no way to remove or weaken the bias intrinsic in MIT1003.
- The expected probability distribution for each observer is not computed, and only the set of scan paths for all the observers provided in MIT 1003 can be given as a whole.
- The CPE is used only in preprocessing, and the association between patches and fixations is only based on the Dinov2 original pretraining, namely, on the features Dinov2 learned for classification, which cannot be modified by the training since there is no backpropagation through the CPE. Note that the authors claim that there is no need for "fine-tuning".
- The authors ' answers are also often contradictory, for example, in the last answer it is written: "Critically, CPE is also applied during inference as it is a core component of the attention mechanism that the model learned to rely on. " While in a previous answer they write: "The purpose of CPE is not to learn, but to provide a sense of order, which is essential for models that process sequence data in parallel. CPE stamps each trajectory point with its unique position in the sequence."
- The authors engage only with two small datasets, one of which is strongly biased, use DDPM, pretrained Dinov2, and a preprocessing step, CPE. All in all, it is hard to see an effective contribution.
I have essentially focused on the deep learning modelling rather than the cognitive aspects raised by reviewers 4EVS and A95t, which are very interesting, although we arrive at different conclusions with A95t.
Thanks again.
Dear Reviewer,
We would like to thank you for your rigorous discussion. Please find our explanations for the points you raised below.
There is no ground to learn the conditioning. The temporal structure is limited to the noise removal, typical of diffusion..
Our model learns conditioning through its trainable layers (e.g. cross attention). These layers are optimized to associate image content with specific gaze locations by minimizing the standard diffusion loss, which uses ground-truth raw trajectories from the MIT1003 dataset as the learning signal.
The temporal structure of eye movements is learned implicitly. By training the model to denoise the full, continuous spatio-temporal signal of the raw trajectories, it inherently captures the underlying dynamics of realistic human gaze, including both fixations and saccades.
There is no way to remove or weaken the bias intrinsic in MIT1003.
Our model's strong performance on the entirely unseen OSIE dataset demonstrates that it generalizes beyond the training set's specific biases.
The expected probability distribution for each observer is not computed, and only the set of scan paths for all the observers provided in MIT 1003 can be given as a whole.
The model's goal is to learn the population-level distribution of human gaze, capturing the inherent variability across all subjects. The generated set of trajectories successfully captures the overall distribution of the ground-truth trajectories from all observers.
The CPE is used only in preprocessing, and the association between patches and fixations is only based on the Dinov2 original pretraining, namely, on the features Dinov2 learned for classification, which cannot be modified by the training since there is no backpropagation through the CPE. Note that the authors claim that there is no need for "fine-tuning".
We use DINOv2 as a powerful, non-trainable feature extractor. As a foundational model trained on 1.2B unique images, it provides rich semantic features for in-the-wild images without requiring any fine-tuning. CPE's crucial role is to create a shared coordinate system that aligns these powerful image features with the gaze trajectory data. The actual learning of the association happens in the trainable cross-attention modules within our U-Net. The gradient from the loss function flows through the simple CPE addition operation to update the weights of these modules, teaching them the relationship between visual content and gaze patterns. The importance of this structure is confirmed by our ablation study, which shows that removing CPE significantly degrades performance.
The authors ' answers are also often contradictory, for example, in the last answer it is written: "Critically, CPE is also applied during inference as it is a core component of the attention mechanism that the model learned to rely on. " While in a previous answer they write: "The purpose of CPE is not to learn, but to provide a sense of order, which is essential for models that process sequence data in parallel. CPE stamps each trajectory point with its unique position in the sequence."
These two points are not contradictory but describe CPE's function from two different perspectives: its purpose during training and its role in the final architecture.
-
During training, CPE's purpose is to align image features with gaze trajectories by applying the same positional "stamp" to both. This shared coordinate system is what allows the model's trainable attention parameters to learn the association between visual content and where a person is looking.
-
During inference, because the model was trained to depend on this structure, CPE becomes a core and essential component of the final architecture, and is included for the learned attention mechanism to function correctly.
The authors engage only with two small datasets, one of which is strongly biased, use DDPM, pretrained Dinov2, and a preprocessing step, CPE. All in all, it is hard to see an effective contribution.
Our contribution is the novel DiffEye framework, the first to apply a diffusion model to generate continuous, raw eye-tracking trajectories conditioned on natural images. This framework includes the proposed CPE mechanism and UNet based diffusion modelling. Its effectiveness is demonstrated by its strong generalization results.
Thank you
Paper proposes a diffusion-based for the novel problem of raw trajectories prediction of gaze. The model was trained a single dataset, MIT1003, as it contains the raw trajectories data. The predictions were converted to scanpaths and saliency maps for direct comparisons with existing scanpath and saliency prediction models.
优缺点分析
Strengths S1. The hypothesis that raw trajectories better capture the richness of eye movements/gaze is novel. This is somewhat supported by the experimental results.
S2. The application of diffusion method to visual attention seems novel.
Weaknesses W1. The claim by the paper than scanpath is a "compression of trajectories" is flawed. Instead, fixations are critical for the brain to consolidate the information in the foveated vision, as shown by various cognitive studies. These works show that information is processed by the brain during fixations and not saccades (trajectories). As such, it is not clear what is the actual value proposition of raw trajectories prediction only (without fixations/saliency).
Rucci M., Poletti M. (2015). "Control and function of fixational eye movements". Annual Review of Vision Science. 1: 499–518
W2. The comparison of the scanpath and saliency prediction results are limited to 2 datasets. In [20], the COCO-FreeView and COCO-Search18 were included. The calls into the question of the generalization of the proposed method as COCO-FreeView is a large dataset which contains scanpath data.
W3. The evaluation metrics are non-standard for visual attention community. There are standard scanpath metrics like sequence score (SS), Semantic Sequence Score (SemSS). For saliency prediction, AUC (Area Under Receiver Curve) and NSS (Normalized Scanpath Saliency) and Correlation Coefficient (CC). But the paper only uses trajectories-based metrics.
W4. (minor) Paper claims that the current models are unable to capture the diversity of individual viewers. But there is no attempt to model individual differences in their approach. Hence, this point, while true, is irelevant to their work.
问题
The inadequate experiments in terms of datasets and use of non-standard metrics should be explained and strongly justified.
The claim that the prediction of the raw trajectories is better than scanpath prediction is not well supported by eye movement studies. The experimental results while somewhat supportive of this claim, was not comprehensive in size (only 2 datasets) and metric (using trajectories-based instead of visual attention metrics) to be convincing.
局限性
No. The authors did not adequately address the limitations as explained in the previous review sections.
[After rebuttal] Authors had sufficiently addressed my concerns regarding the positioning and motivation of the work. This is especially that they had evaluated their work with the standard visual attention metrics.
最终评判理由
Authors had sufficiently addressed my concerns regarding the positioning and motivation of the work. This is especially that they had evaluated their work with the standard visual attention metrics.
格式问题
No concern
We sincerely thank the reviewer for their detailed and constructive feedback. We appreciate the acknowledgment of our work's novelty in its hypothesis (S1) and methodology (S2). Below, we address the identified weaknesses and questions by quoting the specific concerns and providing our clarifications.
W1: Value Proposition of Raw Trajectories
"The claim by the paper than scanpath is a 'compression of trajectories' is flawed... it is not clear what is the actual value proposition of raw trajectories prediction only (without fixations/saliency)."
We used "compressed" in the sense that scanpaths, while capturing the most critical information for cognitive processing, are a lower-dimensional summary of the full spatio-temporal eye movement data. In the final version of the paper we will use more appropriate language.
Our motivation for modeling raw trajectories is twofold. First, it provides a more fundamental and flexible representation of the gaze data produced by eye tracking. By generating the foundational raw signal, our approach allows end-users the "flexibility to define and extract fixations using any desired approach". This is beneficial because different eye tracking studies and products might use different approaches to extract the scanpath. Our model provides the underlying data, making it adaptable to any specific post-processing pipeline. From the modeling perspective, our approach avoids the need for arbitrary decisions that have plagued prior scanpath-based methods. For example, DeepGaze III and IOR-ROI require the number of fixations to be specified as a fixed input for scanpath synthesis, and this may not be straightforward to specify for a given stimuli. In contrast, DiffEye generates a continuous trajectory. Consequently, the number of fixations emerges naturally as a property of the generated data. This allows the model to produce scanpaths with an adaptive number of fixations for each image, which better reflects the natural diversity of human viewing patterns. As shown in the table below, mean number of fixations generated by our model is closer to the original statistics.
Number of Fixations
| Model | mean (MIT1003) | mean (OSIE) |
|---|---|---|
| Ground Truth | 8.40 | 9.40 |
| Ours | 8.48 | 8.08 |
| HAT | 12.04 | 12.37 |
| Gazeformer | 3.98 | 4.09 |
| Chen et al. | 10.04 | 12.00 |
Second, raw trajectories provide a richer signal for generative models. They contain fine-grained spatio-temporal information, like microsaccades and saccadic curvature, which is lost during scanpath extraction. As noted in our paper, raw trajectories in the MIT1003 dataset average 723.7 timesteps, while scanpaths average only 8.4, indicating a "substantial reduction in spatio-temporal information". Diffusion models have proven highly effective at learning complex, high-dimensional data distributions, making them a natural and powerful choice for modeling the intricate patterns found in raw eye-tracking data. By learning from this richer data space, our model can produce more realistic and nuanced gaze behaviors, which is supported by our superior performance on trajectory-based metrics.
W2 & W3: Datasets and Evaluation Metrics
"The comparison of the scanpath and saliency prediction results are limited to 2 datasets... The calls into the question of the generalization of the proposed method..." "The evaluation metrics are non-standard for visual attention community... But the paper only uses trajectories-based metrics." We understand the reviewer's concerns regarding the scope of our evaluation. Our choices were guided by our core hypothesis and significant practical data constraints.
On Datasets: Our method's primary requirement is the availability of raw eye-movement trajectory data, not just scanpaths. The MIT1003 dataset is "the only publicly available dataset that provides raw eye-tracking data for natural images obtained during a free-viewing task". While datasets like COCO-FreeView are larger, they only contain scanpath data and thus cannot be used to train our model, which operates on the raw signal. Our purpose in this work is not to claim state-of-the-art scanpath prediction across all datasets, but to introduce and validate a novel methodology for generating continuous eye-tracking data, a task no prior work has addressed for natural images. We demonstrated strong generalization by evaluating on "the entirely unseen OSIE dataset", where our model performed competitively without any fine-tuning. We will modify the final version of the paper to clarify our claim for performance.
On Evaluation Metrics For Scanpaths: We chose trajectory-based metrics (Levenshtein, DTW, DFD) because our model is inherently generative, and these metrics excel at comparing the similarity between entire sequences, capturing both spatial and temporal structures. This aligns with our goal of evaluating the model's ability to replicate the distribution of human gaze behavior. However, based on the reviewe’s comment we added additional SS and SemSS metrics. Please find SS and SemSS results below:
MIT1003 Results
| Model | SS (↑) | Sem SS (↑) |
|---|---|---|
| Ours | 0.4782 | 0.6611 |
| HAT | 0.4079 | 0.5794 |
| Gazeformer | 0.3531 | 0.4522 |
| DeepGazeIII | 0.4440 | 0.6604 |
| ROI | 0.4506 | 0.6603 |
| Chen et al. | 0.4237 | 0.6397 |
OSIE Results
| Model | SS (↑) | Sem SS (↑) |
|---|---|---|
| Ours | 0.4371 | 0.5837 |
| HAT | 0.4002 | 0.5791 |
| Gazeformer | 0.2713 | 0.3602 |
| DeepGazeIII | 0.4623 | 0.6459 |
| ROI | 0.4404 | 0.6110 |
| Chen et al. | 0.4333 | 0.5711 |
Our model achieves state-of-the-art performance on the MIT1003 dataset, excelling in both Sequence Score and Semantic Sequence Score. Furthermore, the model generalizes effectively by remaining competitive on the unseen OSIE dataset, validating that the rich signal from raw trajectories produces high-quality scanpaths. This strong generalization stands in sharp contrast to models like ROI, which, having been trained on the OSIE dataset, performs well there but fails to produce competitive results on the MIT dataset. Notably, our model’s robust performance was achieved with significantly less data than competing models. While DeepGazeIII was trained on approximately 600,000 scanpaths from 11,000 images, our model used only 8,900 trajectories from 1,000 images. We attribute any minor performance differences to this vast disparity in data scale, demonstrating that by using rich trajectory information, our method can achieve competitive, and even superior, results with less than 10% of the training images.
On Evaluation Metrics For Saliency: We would like to clarify that we did perform a full evaluation using standard saliency metrics (AUC-Judd, AUC-Borji, NSS, SIM, CC, KL). These results are presented in Table 4 in the supplementary material.
W4: Modeling Diversity
"(minor) Paper claims that the current models are unable to capture the diversity of individual viewers. But there is no attempt to model individual differences in their approach. Hence, this point, while true, is irrelevant to their work."
We thank the reviewer for this clarifying comment.
Our point was simply that stochastic generative models are advantageous relative to deterministic synthesis approaches that produce a single, "average" scanpath for a given image, which fails to capture the natural diversity in gaze trajectories arising when different subjects attend to different regions.
Our diffusion-based approach inherently addresses the stochasticity of eye tracking data across subjects. By learning the entire distribution of trajectories from the population data, our model can be sampled to generate a diverse set of plausible outputs for a single input image. Each generated sample is a trajectory that could have been produced by a human viewer, thereby capturing the variability observed across the population. We will revise the phrasing in the abstract and introduction to make this important distinction clear, and mention the characterization of individual differences as a topic for future work
We would like to thank reviewer for their insightful review. We have aimed to address your points in our rebuttal and trust this resolves your concerns. We would be happy to do any further clarification you may need.
The authors devise a system based on diffusion, specifically the denoising diffusion probabilistic (DDPM) model, that takes an image as input and generates eye movement trajectories. The system is trained end-to-end on the raw eye-tracking data from the MIT1003 dataset. The authors introduce several new architectural and algorithmic components, in that image patch features are computed and trajectories are conditioned on partial past trajectories. Importantly, the authors report that the embedding of the gaze positions within a trajectory has to be aligned with the features of the image patch to achieve the observed performance in prediction. As the system generates raw eye movement trajectories, scanpaths and saliency maps can be derived from them, and the authors show improved performance on several benchmarks.
优缺点分析
The manuscript is clearly written, and the figures very clearly help in evaluating the system’s architecture and simulation results.
It is a good idea to train models on the eye movement trajectories instead of the scanpaths or empirical saliency maps. The empirical evaluations in this manuscript that generate scanpaths and saliency maps from the generated trajectories support this conclusion.
It is also a good idea to use a probabilistic model as one of the key observations in human eye movement sequences has been that they are highly variable.
I would be critical about the claim that the system helps “to better understand how the complex human attention system operates on images”. This is one big data fitting exercise. Scientifically, we do not understand the human attention system better. This is also not relevant here, because there will be ample interest in this system simply to be able to better predict human gaze.
问题
Can the authors give the reader an intuitive explanation for the necessity of the CPE alignment beyond the statement: "Corresponding Positional Embedding (CPE) To further improve spatial alignment, we introduce a novel positional embedding method, Corresponding Positional Embedding (CPE)"?
局限性
yes
最终评判理由
I have now not only checked the rebuttal from the authors but also read through the answers to the other reviewers' comments. As I said, I think that this is a valuable contribution and therefore, I maintain my overall positive score. Maybe a few comments.
The authors use the term scanpath as it is used in the computer vision community which sometimes differs from the use in the cognitive science & neuroscience communities. The term is used in the CV field to distinguish the succession of fixation locations from the full and raw gaze data.
As for information not being completely shut out during saccades and at least being processed to some degree, see e.g. the work by Martin Rolfs.
Many of the scanpath metrics that are still being used are fundamentally flawed, see e.g. Matthias Kuemmerer and Matthias Bethge. State-of- the-art in human scanpath prediction.
The model indeed does not model individual observers but the distribution across observers. I find this a plus.
In my view, this manuscript contains more scientific content than other manuscripts that seem on a trajectory of being accepted at Neurips, at least in my batch of papers.
格式问题
n/a
Dear Reviewer ftE1,
Thank you for your thoughtful and constructive review of our manuscript. We appreciate your valuable feedback.
We are encouraged that you recognized the core strengths of our work, including:
- The clarity of the manuscript and the effectiveness of the figures.
- The value of our approach in training directly on raw eye-movement trajectories.
- The appropriate use of a probabilistic model to account for the inherent variability in human gaze.
I would be critical about the claim that the system helps “to better understand how the complex human attention system operates on images”. This is one big data fitting exercise. Scientifically, we do not understand the human attention system better.
We agree with this statement, and will remove this sentence from the final paper and focus on positioning our work in the context of human gaze trajectory generation.
Can the authors give the reader an intuitive explanation for the necessity of the CPE alignment beyond the statement: "Corresponding Positional Embedding (CPE) To further improve spatial alignment, we introduce a novel positional embedding method, Corresponding Positional Embedding (CPE)"?
Thanks for this useful question, we provide an in-depth explanation below, and we will add this discussion to the final version of the paper. The core challenge in generating a realistic eye-tracking trajectory is to teach the model not just what is in an image, but precisely where the eye should look in relation to that content. Our initial attempts to condition the model using a single global image feature vector led to poor generation quality (see Table 2 w/o Patch Level Features), as the global feature lacks localized spatial semantics necessary for effective conditioning. The model understood the image's global context but could not ground the gaze path in specific local regions. CPE solves this by providing a common spatial "address system" for both the visual input and the gaze trajectory.
-
Creating the "Map": We first construct a 2D positional embedding grid, denoted as , which acts as a universal map based on the original image's resolution (). This grid assigns a unique positional vector—an "address"—to every coordinate in the image space.
-
Tagging the Trajectory: For each coordinate in the eye-tracking trajectory , we retrieve its corresponding positional address from the map and add it directly to that trajectory point's embedding. The final augmented trajectory token is computed as: . The trajectory is now a sequence of coordinates, each tagged with its absolute position on the shared map.
-
Tagging the Image Content: Crucially, we use the exact same map for the image's patch-level features, . We interpolate the map to the patch resolution, yielding , and add these positional addresses to their corresponding image patch features. This is computed as: . Now, each piece of visual content is also tagged with its absolute position on the same shared map.
"By sharing and aligning positional information across both trajectories and image patches, CPE enables effective spatial correspondence and interaction between gaze behavior and visual content". Because both the gaze location and the image content now speak the same spatial language, our model's cross-attention mechanism can effectively associate a gaze point with the visual content at that exact location.
The importance of this component is empirically demonstrated in our ablation study (Table 2). When CPE is removed (w/o CPE), the model's performance degrades significantly across all metrics, especially those sensitive to spatial alignment like Levenshtein and Fréchet distances. This confirms that CPE's explicit spatial alignment is a critical contribution that enhances the model's ability to effectively localize attention signals.
We want to thank the reviewer again for their insightful review. We hope our response addressed your concerns, and in particular, that our explanation of the Corresponding Positional Embedding (CPE) was clear. We are happy to provide any further clarification you may need.
I still don't think that this work is revolutionizing at the computational or algorithmic level, but I think that it shows that additional information is contained in the entire eye movement data, which seems like a useful result. I therefore maintain my comparatively positive evaluation.
I have now not only checked the rebuttal from the authors but also read through all the answers to the other reviewers' comments. As I said, I think that this is a valuable contribution and therefore, I maintain my overall positive score. I will neither lower nor increase my score.
We thank reviewer for their feedback and finding our contribution valuable.
Dear reviewers,
Thank you for your time and effort in reviewing this submission. We are in the phase of author-reviewer discussion until 6th August.
I highly encourage you to discuss with the authors for clarification of your concerns. Your active participation in the discussion will be the main guarantee for high-quality publications in our community.
We now have mixed reviews for this paper submission. I hope you could read the rebuttal and comments from each other for a respectful discussion, please. Thank you!
Best regards, Your AC
The reviewers appreciated the novelty of the proposed idea and the strong experimental results. Overall, the ratings from the reviewers are positive. Some concerns were raised regarding certain claims in the paper, as well as the learning aspects, and the limited scope of training and testing on minimal datasets. These issues were addressed to a reasonable extent during the rebuttal. I encourage the authors to further clarify these points and strengthen the experimental validation in the final version.