PaperHub
7.3
/10
Spotlight4 位审稿人
最低4最高5标准差0.5
5
4
5
4
3.8
置信度
创新性3.0
质量2.8
清晰度2.8
重要性2.8
NeurIPS 2025

Gaze Beyond the Frame: Forecasting Egocentric 3D Visual Span

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29
TL;DR

We address the challenge of predicting where a person's visual perception will focus next within their three-dimensional environment.

摘要

关键词
Egocentric visionGaze anticipation2D to 3D lifting

评审与讨论

审稿意见
5

This paper introduces EgoSpanLift, a method that (1) lifts egocentric 2D gaze history into 3D volumetric “visual spans” via SLAM-derived keypoints, and (2) forecasts future spans using a hybrid 3D U-Net + transformer framework. To evaluate, the authors curate large-scale benchmarks: FoVS-Aria and FoVS-EgoExo to demonstrate substantial gains over adapted 2D gaze and 3D localization baselines across multiple metrics

优缺点分析

Strength

  • Problem formulation is interesting and practical
  • Writing about methodology is clear
  • Methodology itself is solid, concise, with ample ablation study
  • Benchmarking results are comprehensive, and the performance gain is substantial

Weakness

问题

  1. While the author says the latency introduced by the method is "trivial", I wonder if there is any data related to this? I would think any AR/VR application would be very sensitive to latency.

  2. Some very minor formatting/grammar issues

  • Binary cross-entropy and BCE should be associated at some point (maybe when it first appears in line 280), so that in Table 1, LBCEL_{BCE} can be less confusing.
  • In Fig. 1 and Fig. 3, Ground truth's truth is not capitalized, despite all other words being capitalized
  • In Sec 5.1 and Sec 5.2, it should be "Forecasting", not "Forcasting"

局限性

Yes

最终评判理由

I believe this paper is a positive contribution to the community. I will maintain my original positive rating.

格式问题

N/A

作者回复

We thank reviewer 7cjL for the helpful feedback. To summarize all reviewer feedback, we are encouraged that the reviewers find our problem formulation interesting and advantageous (7cjL, LuMp, 23sp) while being methodologically solid and innovative (7cjL, LuMp) with clear writing (7cjL, k89X). Additionally, they find that our comprehensive experimental results demonstrate significant performance improvements (7cjL, LuMp, 23sp) with the curation of two new large-scale benchmarks (LuMp, 23sp). Below, we address the specific questions raised in reviewer 7cjL’s comments.

(7cjL-Q1) Detailed analysis on the latency of our framework

Our framework has two primary sources of latency: (i) extracting the relevant set of points from gaze and SLAM keypoints (performed every 100ms), and (ii) performing model inference every second on a set of points spanning two seconds.

Since the point extraction can be pre-computed and stored at 10 fps for continuous use, we only need to consider the computation time for processing the final observation when calculating inference latency. We measured this in a resource-constrained environment compared to our training setup, using 8 CPU cores and a GPU with 12GB VRAM (e.g., Titan X). In the table below, all operations except model inference are processed on the CPU, with the following execution time:

Gaze and keypoint processing
Point preprocessing4.541±1.999ms
3D visual span localization1.811±0.824ms
From keypoints to model output
Voxelization45.406±26.223ms
Model inference19.483±8.234ms
Average inference latency71.241ms

The first stage, which handles outlier removal, axis-aligned bounding box cropping for keypoints, and selection of points within a certain degree of eccentricity from the gaze, can be processed within 10ms. In the second stage, we identified that the primary bottleneck lies in voxelization rather than in the model itself. This occurs because the large number of keypoints from the previous stage should be voxelized, whereas the model operates efficiently once it receives the 3D voxelized representation.

Consequently, the average inference latency is 71.241ms, yielding a real-time factor of 0.036 when processing 2-second input for multi-second forecasting. This confirms our claim that the framework supports fast processing. However, actual AR/VR environments typically operate with even fewer computational resources, and thus additional optimization techniques such as model quantization or more efficient voxelization could be considered for further performance improvements.

(7cjL-Q2) Suggestions on writing

Thank you for your suggestion on the related work. We will include these papers and fix formatting issues.

评论

I appreciate the author's effort in addressing my concerns. I will maintain my positive rating

评论

Thank you for your follow-up comment and for maintaining your positive evaluation. We are truly grateful for your time and effort in helping us improve the manuscript.

审稿意见
4

This paper tackles a novel problem of egocentric 3D visual span forecasting. It introduces EgoSpanLift, a cone-based lifting mechanism that transforms 2D gaze data into 3D multi-level volumetric representations, and proposes a temporal forecasting model to predict future human visual attention in 3D space. The authors also curate two datasets for benchmarking this task. The proposed method demonstrates superior performance compared to existing baselines and shows promising potential for downstream applications in egocentric vision.

优缺点分析

Strengths:

  1. The paper introduces an interesting and timely new task that can significantly advance egocentric vision, especially in interaction-oriented settings.
  2. The proposed 3D gaze representation and forecasting framework are innovative and clearly motivated by cognitive theories and real-world applications.

Weaknesses:

  1. The evaluation on FoVS-EgoExo (Table 3) is somewhat limited. Unlike FoVS-Aria (Table 1), it lacks comprehensive results for extreme baselines such as Global Prior, 2D methods lifted via EgoSpanLift, and 3D localization models like OccFormer or VoxFormer. A fuller set of comparisons would strengthen the conclusions.
  2. Several recent egocentric and interaction-guided attention prediction works are missing in the Related Work section. For example: CVPR'24 Learning from observer gaze: Zero-shot attention prediction oriented by human-object interaction recognition

问题

  1. The paper mentions real-time or latency-sensitive applications. Is the current model suitable for deployment on mobile or AR/VR hardware platforms?
  2. Could the authors elaborate more on the practical applications of 3D visual span forecasting in the real world? What new possibilities does this 3D representation unlock that 2D gaze prediction cannot?
  3. Why did you choose IoU and F1 as the primary evaluation metrics? Could commonly used 2D saliency metrics like KLdiv or CC also be applicable or provide complementary insights?
  4. It might be interesting to consider subject-level information in future work, as different individuals can exhibit diverse gaze patterns. Incorporating such factors could potentially improve generalization across users.
  5. In the broader impact section, the authors mention that this technology may help individuals with cognitive or perceptual impairments gain non-invasive access to desired objects or information. I would be curious to learn more about this point. Could the authors clarify which specific groups of users they are referring to, and how such support might be practically implemented?

Please refer to the above weaknesses and questions. If addressed appropriately, I would be open to raising my score.

局限性

Yes.

最终评判理由

Thank you for your detailed response, which has addressed most of my concerns. The computation time may still be insufficient to meet the demands of real-world applications, but this exceeds the scope of contribution for a single paper. I suggest that the authors (1) include a discussion on the primary evaluation metrics in the final version, as this is crucial for presenting a new paradigm, and (2) supplement the paper with more related works on egocentric and interaction-guided attention prediction, as the current version is lacking in this regard. I am willing to maintain my positive initial score.

格式问题

None.

作者回复

We thank reviewer LuMp for the constructive comments. To summarize all reviewer feedback, we are encouraged that the reviewers find our problem formulation interesting and advantageous (7cjL, LuMp, 23sp) while being methodologically solid and innovative (7cjL, LuMp) with clear writing (7cjL, k89X). Additionally, they find that our comprehensive experimental results demonstrate significant performance improvements (7cjL, LuMp, 23sp) with the curation of two new large-scale benchmarks (LuMp, 23sp). Below, we address the specific questions raised in reviewer LuMp's comments.

(LuMp-Q1) Additional experiments on FoVS-EgoExo

Due to space constraints, Table 3 previously reported only partial entries. We further report two additional ablation experiments, three 3D localization baselines, and one 2D gaze anticipation baseline:

O-IoUO-F1P-IoUP-F1C-IoUC-F1F-IoUF-F1
CSTS + EgoSpanLift + [60]--0.49780.63980.28670.40100.15560.2107
OccFormer0.12870.22800.09200.16850.02510.04900.00840.0167
VoxFormer0.18960.31880.14750.25710.06200.11680.01790.0350
EgoChoir0.32870.49480.28510.44370.19760.33000.12660.2247
Ours w/ BCEloss0.51120.66210.49050.63380.37220.48670.28700.3578
Ours w/o global emb.0.49980.65420.48920.63810.39540.52940.34750.4500
Ours0.53170.68270.52550.66940.44620.57300.39770.4932

Overall, the results show a similar performance distribution to those reported on FoVS-Aria except for the 2D baseline often outperforming EgoChoir, presumably due to increased 3D scene complexity compared to FoVS-Aria. Due to time constraints, additional rows will be incorporated into the draft and the response once they become available.

(LuMp-Q2) Detailed analysis on the latency of our framework

Our framework has two primary sources of latency: (i) extracting the relevant set of points from gaze and SLAM keypoints (performed every 100ms), and (ii) performing model inference every second on a set of points spanning two seconds.

Since the point extraction can be pre-computed and stored at 10 fps for continuous use, we only need to consider the computation time for processing the final observation when calculating inference latency. We measured this in a resource-constrained environment compared to our training setup, using 8 CPU cores and a GPU with 12GB VRAM (e.g., Titan X). In the table below, all operations except model inference are processed on the CPU, with the following execution time:

Gaze and keypoint processing
Point preprocessing4.541±1.999ms
3D visual span localization1.811±0.824ms
From keypoints to model output
Voxelization45.406±26.223ms
Model inference19.483±8.234ms
Average inference latency71.241ms

The first stage, which handles outlier removal, axis-aligned bounding box cropping for keypoints, and selection of points within a certain degree of eccentricity from the gaze, can be processed within 10ms. In the second stage, we identified that the primary bottleneck lies in voxelization rather than in the model itself. This occurs because the large number of keypoints from the previous stage should be voxelized, whereas the model operates efficiently once it receives the 3D voxelized representation.

Consequently, the average inference latency is 71.241ms, yielding a real-time factor of 0.036 when processing 2-second input for multi-second forecasting. This confirms our claim that the framework supports fast processing. However, actual AR/VR environments typically operate with even fewer computational resources, and thus additional optimization techniques such as model quantization or more efficient voxelization could be considered for further performance improvements.

(LuMp-Q3) Practical applications of 3D visual span forecasting

As our title "Gaze Beyond the Frame" implies, our 3D representation offers significant competitive advantages over 2D methods in the scenarios involving perception and interaction beyond the current visual frame. While 2D approaches are constrained to forecasting on existing visual frames, our method can predict attention patterns in previously unseen spatial regions. Foveated rendering in VR/MR systems exemplifies this advantage: when users turn their head to attend to different contexts, our 3D spatial awareness maintains consistent predictions across head movements and viewpoint changes, whereas 2D methods struggle with such dynamic scenarios.

Moreover, our approach potentially enables various forms of proactive environmental assistance by integrating ambient intelligence. For instance, systems can adjust lighting or display configurations in spaces where users are predicted to look, facilitating faster information perception. Additionally, by considering users' established focusing patterns, ambient assistants can proactively remind users of information they might otherwise overlook, creating more intuitive human-environment interactions.

Temporally persistent 3D representations offer another key advantage. While 2D-based forecasting becomes less reliable as self-motion increases, our framework utilizes 3D spatial representation that is relatively robust to self-motion. This opens up the potential for extending our framework for forecasting on a longer horizon, which could enable wearable devices to learn and anticipate user behaviors in their daily 3D environments rather than being limited to momentary 2D observations.

Lastly, our experimental results demonstrate that 3D visual span predictions, when back-projected to 2D image planes, achieve performance comparable to 2D-specific models without requiring 2D-specific training. This suggests that our 3D approach not only unlocks new capabilities but also maintains the effectiveness of traditional 2D methods while providing the additional spatial understanding necessary for future egocentric applications.

(LuMp-Q4) Rationale behind our primary evaluation metrics

We primarily build upon the recent 2D egocentric gaze estimation and anticipation literature [17,33,34], where F1, Precision, and Recall are well-established metric choices. We further complement these metrics by incorporating measures that reflect the nature of 3D forecasting: IoU for measuring overlaps between volumetric regions and metric distance (Table 2) to compensate for narrow regions from the foveal span.

To provide complementary analysis on saliency metrics in our benchmarks, we report CC and AUC over the logit distribution of 3D grids using three variants of our framework. Note that AUC is reported instead of KLD due to zero-value sensitivity of KLD [TPAMI’19] observed in our evaluation:

O-IoUO-F1O-CCO-AUCP-IoUP-F1P-CCP-AUC
Single-task0.58310.72370.72530.86100.47210.61540.62950.7913
w/o Global0.56020.70420.71130.84700.46470.61280.63080.8265
Ours0.58380.72470.72930.85820.48860.63500.64920.8402
C-IoUC-F1C-CCC-AUCF-IoUF-F1F-CCF-AUC
Single-task0.33510.44850.46790.67360.24940.31930.33980.5240
w/o Global0.32420.44760.47670.76510.26240.34870.37550.6132
Ours0.35130.47620.50200.77140.28360.37090.39900.6242

The results demonstrate patterns that are generally consistent with previously reported metrics. We will include this additional analysis in the Appendix.

(LuMp-Q5) Elaboration on users with cognitive and perceptual impairments

Several user groups with cognitive and perceptual impairments could potentially benefit from this technology. These include individuals with visual perceptual deficits who may miss objects in their peripheral vision, those with age-related visual degeneration who experience slower or less accurate visual targeting compared to typical users, and people with mild cognitive impairment who could benefit from contextual reminders about their behavior sequences and task progression.

The implementation of assistive support for these users would vary significantly depending on the underlying device or platform. For instance, in display-equipped systems, highlighted overlays could be used to draw attention to forecasted visual spans, helping users identify areas of upcoming interaction. In environments with voice assistants or ambient intelligence systems, direct or indirect cues (e.g., haptic or voice assistant notification) could be provided to users regarding perceptually overlooked contexts, thereby facilitating both perception and interaction. Such proactive assistance has the potential to reduce cognitive load while maintaining user autonomy by anticipating user needs based on their natural gaze patterns and environmental context.

(LuMp-Q6) Suggestions on future work and relevant papers

Thank you for your suggestion on the related work and future direction. Enhancing subject-aware personalization of 3D forecasting could be an interesting extension by incorporating methods like test-time adaptation. We will discuss them in our final draft.

Reference

[TPAMI’19] Bylinskii et al. (2019). What Do Different Evaluation Metrics Tell Us about Saliency Models?

评论

LuMp, please could you take a look at the author response above and whether it addresses any remaining concerns you have, e.g. limited evaluation

评论

Thank you for your detailed response, which has addressed most of my concerns. The computation time may still be insufficient to meet the demands of real-world applications, but this exceeds the scope of contribution for a single paper. I suggest that the authors (1) include a discussion on the primary evaluation metrics in the final version, as this is crucial for presenting a new paradigm, and (2) supplement the paper with more related works on egocentric and interaction-guided attention prediction, as the current version is lacking in this regard.

评论

Thank you for your additional feedback, and we are delighted to hear that our responses effectively addressed your concerns. While our experimental results in LuMp-Q2 demonstrate the capability for real-time processing (10+ inference per second), tackling the challenges associated with real-world deployment could be an interesting avenue for future extension. We have incorporated the clarification regarding our choice of metrics and expanded the related work section with additional comprehensive coverage. We will ensure that these improvements are well-reflected in the final draft.

审稿意见
5

This paper presents a new way to predict where a person will look next, shifting the problem from a 2D image to a 3D environment. The authors argue that this is more robust/fundamental because people naturally direct their attention to objects and locations in 3D space. Their solution has two main parts:

  • The first, EgoSpanLift, takes the video from a user's perspective and uses SLAM to build a 3D map of the surroundings in real-time. Then, using the user's gaze direction, it performs a simple geometric calculation to determine which 3D points fall within different levels of their visual span (ranging from a tiny foveal focus to a much wider peripheral one). The result is a set of 3D grids that represent where the user's attention is focused at any given moment.
  • The second is a forecasting network designed to predict what these 3D attention grids will look like in the near future. It takes a history of the grids created by EgoSpanLift and feeds them into a model that uses a 3D U-Net to understand the spatial patterns of attention and a transformer to understand how that attention moves over time. The final output is a predicted 3D volumetric region where the user's visual perception will focus next. To prove their method works, the authors created two new large-scale benchmarks from existing egocentric video datasets and showed that their approach outperforms a wide range of other methods.

优缺点分析

Strengths:

  • IMO the paper frames the problem of visual attention/gaze estimation in a sensible fashion. By constructing a state of the 3D environment (via SLAM) and modeling the user's focus here, this formulation provides many advantages (e.g., disentanglement from the user's own motion, 3D is more fundamental than 2D).
  • The authors went through the process of curating two large-scale benchmarks, FoVS-Aria and FoVS-EgoExo, specifically for this new 3D forecasting problem.
  • They show that their framework significantly outperforms a wide range of other approaches, including models adapted from both 2D gaze prediction and other 3D localization tasks. The improvement is especially large when predicting the user's precise/foveal gaze.

Weaknesses:

  • The proposed system relies on getting a good 3D map from SLAM. The authors harness "3D semidense keypoints" from SLAM, but I'm unclear how important it is / what happens when the SLAM algorithm struggles (e.g. during fast motion) such that a good map is not produced. If the foundational map is inaccurate, the entire gaze forecasting pipeline will be thrown off, limiting its reliability in the real world.
  • It might be too computationally heavy for real-world use right now, and it would be nice if the authors can shed some light on this aspect. The forecasting network uses a 3D U-Net and a unidirectional transformer on a sequence of 3D volumetric grids. This kind of architecture is powerful but also known to be computationally hungry. For latency-sensitive applications like AR/VR running on a mobile headset, this could be too slow.

问题

Please address my comments under Weaknesses.

局限性

N/A

最终评判理由

I have read the rebuttal carefully and decided to update my review -- same rating (5/accept) but with a higher confidence (4).

格式问题

None.

作者回复

We thank reviewer 23sp for the constructive feedback. To summarize all reviewer feedback, we are encouraged that the reviewers find our problem formulation interesting and advantageous (7cjL, LuMp, 23sp) while being methodologically solid and innovative (7cjL, LuMp) with clear writing (7cjL, k89X). Additionally, they find that our comprehensive experimental results demonstrate significant performance improvements (7cjL, LuMp, 23sp) with the curation of two new large-scale benchmarks (LuMp, 23sp). Below, we address the specific questions raised in reviewer 23sp's comments.

(23sp-Q1) What happens when SLAM struggles?

Since our approach utilizes semidense keypoints derived from SLAM, their quality could affect performance. However, this applies to any framework that relies on SLAM-derived keypoints as input. Still, it is feasible to utilize accurate pre-mapped information as a fallback option since our framework should typically be deployed in familiar everyday environments, e.g., home and office.

To conduct a sensitivity analysis examining how the performance of our framework changes when SLAM struggles, we explore several types of sensory corruption across mild to severe scenarios. First, multi-sensor devices may experience temporary frame drops or time drift due to hardware issues, representing temporal corruption. Second, spatial degradation in egocentric localization may occur, which we apply separately to translation and rotation components. Finally, we consider corruption that can affect the set of keypoints by adding Gaussian noise to individual points or dropping certain points entirely. Performance results measured on the FoVS-Aria test split for each corruption type are presented in the following table:

O-IoUO-F1P-IoUP-F1C-IoUC-F1F-IoUF-F1
Baseline (EgoChoir)0.49590.65790.43020.55810.26120.36080.19870.2311
Ours0.58380.72470.48860.63500.35130.47620.28360.3709
Temporal (5%)0.55920.69560.46660.60710.33430.45430.26970.3530
Temporal (10%)0.53470.66740.44570.58100.31940.43440.25710.3371
Translation (5cm)0.58140.72260.48710.63320.33990.46610.26110.3498
Translation (10cm)0.57270.71580.47800.62580.31290.44070.21740.3022
Rotation (2.5°)0.58180.72270.48820.63380.34610.47170.26810.3563
Rotation (5°)0.57750.71930.48360.62980.33280.45750.24900.3357
Gaussian (2.5cm)0.57590.71900.48390.63130.34320.46970.27580.3661
Gaussian (5cm)0.52920.68310.45030.60370.31330.44360.23640.3336
Point drop (10%)0.58230.72310.48900.63460.35000.47510.28230.3691
Point drop (20%)0.57990.72120.48710.63300.34980.47410.28060.3676

Under mild corruption, the impact on performance is negligible. However, applying higher-intensity corruption results in performance degradation of several percentage points. Among the different spans, those requiring wider coverage (Orientation and Peripheral) show smaller performance drops. In contrast, spans demanding higher precision (Central and Foveal) exhibit larger degradation. Despite these challenges, our framework maintains superior performance compared to EgoChoir even under severe corruption scenarios. This suggests that our framework can perform reliable forecasting despite some degree of SLAM-induced imprecision. In practice, real-world scenarios involve multiple types of corruption occurring in complex and somewhat unpredictable combinations. Therefore, training our framework to be robust against such corruption would represent an interesting extension for future work.

(23sp-Q2) Detailed analysis on the latency of our framework

Our framework has two primary sources of latency: (i) extracting the relevant set of points from gaze and SLAM keypoints (performed every 100ms), and (ii) performing model inference every second on a set of points spanning two seconds.

Since the point extraction can be pre-computed and stored at 10 fps for continuous use, we only need to consider the computation time for processing the final observation when calculating inference latency. We measured this in a resource-constrained environment compared to our training setup, using 8 CPU cores and a GPU with 12GB VRAM (e.g., Titan X). In the table below, all operations except model inference are processed on the CPU, with the following execution time:

Gaze and keypoint processing
Point preprocessing4.541±1.999ms
3D visual span localization1.811±0.824ms
From keypoints to model output
Voxelization45.406±26.223ms
Model inference19.483±8.234ms
Average inference latency71.241ms

The first stage, which handles outlier removal, axis-aligned bounding box cropping for keypoints, and selection of points within a certain degree of eccentricity from the gaze, can be processed within 10ms. In the second stage, we identified that the primary bottleneck lies in voxelization rather than in the model itself. This occurs because the large number of keypoints from the previous stage should be voxelized, whereas the model operates efficiently once it receives the 3D voxelized representation.

Consequently, the average inference latency is 71.241ms, yielding a real-time factor of 0.036 when processing 2-second input for multi-second forecasting. This confirms our claim that the framework supports fast processing. However, actual AR/VR environments typically operate with even fewer computational resources, and thus additional optimization techniques such as model quantization or more efficient voxelization could be considered for further performance improvements.

评论

Dear Reviewer 23sp,

We sincerely appreciate your insightful feedback on our manuscript. With the reviewer-author discussion phase nearing its conclusion, we wanted to follow up to see if our response has satisfactorily resolved the issues you raised. Should you have any additional observations or queries, we would be happy to address them.

评论

23sp, please could you take a look at the author response above and whether it addresses any remaining concerns you have, e.g. the requirement of a good 3D SLAM map.

审稿意见
4

The authors propose an approach to forecast "visual spans" form egocentric recordings. The method predicts in 3D, by lifting from 2D. The method is evaluated both in the 3D scenario, as well as for 2D egocentric gaze anticipation.

优缺点分析

Strengths & Weaknesses:

The problem is relevant and the paper quite readable. The technical description appears to be valid, and is supported by adequate figures. My main criticisms are concerning the conceptual motivation/background of the work and an inadequate representation of previous works. I am also questioning whether "egocentric span prediction" as formulated by the authors has relevance on top of the task of 3D gaze anticipation.

Concept from visual attention research are used in an inconsistent and confusing way. Sometimes authors talk about forecasting gaze, sometimes about predicting future visual focus, most prominently (and in the title of the method), authors talk about forecasting future visual spans.

Concerning the terminology of "visual span": Visual span is mainly used in the context of reading and denotes the number of letters to the left and right of the fixation point that can be recognised correctly. See Frey & Bosse (2018) for a clarification on the terminology. The visual span depends on different aspects of the stimulus, e.g. character orientation or crowding. The crucial point here is that the visual span is not a cone around the gaze point that is independent of the stimulus. In contrast, the authors in the present paper do not capture visual span, as they simply define a stimulus-independent stack of cones around the gaze point. They also do not evaluate visual span prediction. To evaluate visual span, they would need to check whether their method is able to predict which objects around the gaze point can be recognised by the user.

I am not convinced that defining these regions around the gaze point gives us any new insights in the evaluation. Alternatively, the authors could simply evaluate the accuracy of 3D gaze prediction. My view is supported by Table 1 in the paper: the ordering of methods is highly consistent between Orientation metrics and Peripheral/Central/Foveal metrics. The same applies for Table 3. An alternative to the way the authors pose the problem is to simply anticipate 3D gaze, evaluate the quality of this gaze, and provide peripheral/central/foveal areas as a post-processing step.

In line 32, the authors state that "However, attempts to forecast human visual perception itself remain less explored." I am not sure about this statement, as there are already quite a few works that try to do this (and are not referenced by the authors), e.g. see Steil et al. (2018), Hu et al. (2021), Rolff et al. (2022). A similar point concerns lines 40-42: "While previous research has shown impressive results in predicting egocentric future gaze fixations on 2D image frames [16, 17], forecasting gaze for dynamic scenarios in 2D remains unclear.

The related work section is also misleading in these respects. For example, in the paragraph on "Egocentric Gaze Prediction", authors refer to works such as gaze following ([24, 25]) which is not an egocentric setting, and also not about anticipation. Relevant gaze anticipation works are not discussed (e.g. Steil et al., 2018; Hu et al., 2021; Rolff et al., 2022).

References:

  • Frey, A., & Bosse, M. L. (2018). Perceptual span, visual span, and visual attention span: Three potential ways to quantify limits on visual processing during reading. Visual Cognition, 26(6), 412-429.
  • Steil, J., Müller, P., Sugano, Y., & Bulling, A. (2018, September). Forecasting user attention during everyday mobile interactions using device-integrated and wearable sensors. In Proceedings of the 20th international conference on human-computer interaction with mobile devices and services (pp. 1-13).
  • Hu, Z., Bulling, A., Li, S., & Wang, G. (2021). Fixationnet: Forecasting eye fixations in task-oriented virtual environments. IEEE Transactions on Visualization and Computer Graphics, 27(5), 2681-2690.
  • Rolff, T., Harms, H. M., Steinicke, F., & Frintrop, S. (2022, September). Gazetransformer: Gaze forecasting for virtual reality using transformer networks. In DAGM German Conference on Pattern Recognition (pp. 577-593). Cham: Springer International Publishing.

问题

To re-consider my score, I would like to see a convincing clarification of the concept of "visual span" (and how it relates to and motivates the author's works), and a convincing argument why the problem can not simply be addressed by doing 3D gaze anticipation + post-processing.

局限性

Limitations should be discussed in the main paper.

最终评判理由

The author's promise to clarify the concept of visual span and the additional evaluations lead me to increase my score. I am still not totally convinced about where the difference to the postprocessing approach really lies. It would be great to convey a better intuition on this. I recommend not to use the name "visual span", as this terminology is too messy in the context of the author's work and will further confuse readers.

格式问题

no major issues

作者回复

We thank reviewer k89X for the detailed comments. To summarize all reviewer feedback, we are encouraged that the reviewers find our problem formulation interesting and advantageous (7cjL, LuMp, 23sp) while being methodologically solid and innovative (7cjL, LuMp) with clear writing (7cjL, k89X). Additionally, they find that our comprehensive experimental results demonstrate significant performance improvements (7cjL, LuMp, 23sp) with the curation of two new large-scale benchmarks (LuMp, 23sp). Below, we address the specific questions raised in reviewer k89X’s comments.

(k89X-Q1) Conceptual motivation of our work

As noted in L39-40, we acknowledge that the term 'visual span' originates largely from text and symbolic reading research in cognitive psychology. While we draw inspiration from this foundation, our work addresses daily and casual behaviors and interactions from an egocentric perspective, without aiming to provide precise computational modeling of the original terminology. While reading-based visual span depends on specific textual stimuli, forecasting visual perception in egocentric daily activities involve diverse, dynamic environments where stimulus-independent spatial regions provide a more practical and generalizable framework. We will clarify the terminology discrepancies in our final draft. Please note that the other three reviewers have recognized the merit of this conceptual formulation, with reviewer 7cjL highlighting our 'practical problem formulation', reviewer LuMp noting that we are 'clearly motivated by cognitive theories and real-world applications', and reviewer 23sp stating that we 'frame the problem in a sensible fashion that provides many advantages,' respectively.

Our cone-based multi-level interpretation serves as a plausible and practical representation for modeling eccentricity and periphery, which is reflected in our substantial performance gains over prior art. We note that gaze, visual focus, and visual span are known to be interconnected [VisCog’13, TPAMI’18, NatHumBehav’19], and similar concepts and wordings are commonly used in related computer vision work, including previous research [16, 17, 32, 33, 34, ETRA’10, ICCV’21]. Importantly, our core methodological and empirical contributions, which have been appreciated by the other three reviewers, remain valuable regardless of naming choices. To address any remaining terminological concerns, we are open to adopting alternative terms.

(k89X-Q2) Rationale behind our multi-level representation

While the monotonic results in Table 1 across different spans may appear redundant, this impression arises simply from the broad prediction target regions. In reality, multi-level interpretation carries significant implications due to uncertainties in self-motion forecasting and geometric correspondence between 2D and 3D spaces. To demonstrate this importance, we conducted an ablation study applying post-processing to 3D gaze forecasting using a model trained with the multi-level components removed from our framework:

P-IoUP-F1C-IoUC-F1F-IoUF-F12D-F12D-Pr2D-Re
2D gaze + postprocessing0.45670.60760.23420.34230.13880.19480.5150.4970.535
3D gaze + postprocessing0.47230.62140.26660.37910.24940.31930.5050.4320.608
Ours (Single-task)0.47210.61540.33510.44850.24940.31930.5050.4320.608
Ours0.48860.63500.35130.47620.28360.37090.5150.4400.619

Predicting only gaze and applying post-processing is clearly suboptimal. Compared to the foveal forecast performance when jointly predicting all multi-level spans, foveal forecast performance shows an F1 decrease of approximately 5.2, stressing the importance of our multi-level representation and multi-task objectives. Furthermore, the central forecast exhibits an F1 decrease of approximately 9.7, which is substantially larger than the 2.8 F1 performance decrease observed in our model trained to predict only the central span. This demonstrates that predictive performance across different spans involves complex geometric and learning dynamics that extend beyond simple monotonic relationships. We will include this analysis in our final draft to clarify the benefit of our formulation.

(k89X-Q3) Suggestions on relevant papers

Thank you for your suggestion regarding the related work from mobile and VR research. We will cite these works [MobileHCI'18, TVCG'21, GCPR'22] and revise our discussion accordingly. Please note that we reference works [24, 25] in our related work section before narrowing our scope to egocentricity, as these are some of the seminal works for gaze comprehension in computer vision to the best of our knowledge.

Reference

[ETRA’10] Blignaut (2010). Visual Span and Other Parameters for the Generation of Heatmaps.

[VisCog’13] Nuthmann (2013). On the Visual Span during Object Search in Real-world Scenes.

[TPAMI’18] Masse et al. (2018). Tracking Gaze and Visual Focus of Attention of People Involved in Social Interaction.

[NatHumBehav’19] van Ede et al. (2019). Human Gaze Tracks Attentional Focusing in Memorized Visual Space.

[ICCV’21] Jiang and Ithapu (2021). Egocentric Pose Estimation from Human Vision Span.

评论

The author's promise to clarify the concept of visual span and the additional evaluations lead me to increase my score. I am still not totally convinced about where the difference to the postprocessing approach really lies. It would be great to convey a better intuition on this. I recommend not to use the name "visual span", as this terminology is too messy in the context of the author's work and will further confuse readers.

评论

Thank you for taking the time to review our rebuttal. We are very pleased that our clarification of our conceptual motivation and the proposed framework was helpful in addressing your concerns.

Beyond the substantial experimental gains (+9.7 C-F1 in k89X-Q2), two key conceptual differences between our framework and the postprocessing variant are (i) learning from multi-level representation and (ii) mitigating uncertainties in self-motion anticipation through end-to-end prediction. Leveraging the interconnection between gaze and periphery [15] allows us to capture cues about future gaze from previous periphery and forecast future periphery in light of previous gaze, which can be observed in several qualitative examples.

Moreover, given the nature of the viewing frustum, extending 3D gaze predictions to broader spans necessitates the forecasting of egocentric 6DoF pose trajectories with a separate postprocessing stage, which propagates uncertainties in self-motion forecasting and geometric correspondence between 2D and 3D spaces, e.g., (23sp-Q1). In contrast, our end-to-end framework does not assume a specific forecasted trajectory and predicts plausible 3D spans within the scene while implicitly learning to mitigate such uncertainties.

We will incorporate your suggestions in our final draft.

最终决定

This paper received 2 borderline accept scores and 2 accept scores as final ratings. Initially, reviewers liked the paper due to the curation of two datasets; strong motivations; interesting and timeliness of the proposed new task; and clear writing within the paper. They also had some criticisms towards the paper, namely that the concept of a visual span and how it could be solved in other ways; whether the model is too heavy for real-time applications; the underlying need for a strong SLAM map; and some missing related works. During the rebuttal stage, many of the reviewers' questions were answered with new results; additional descriptions; and analysis. Accordingly, all reviewers gave a positive final rating towards the paper. The AC agrees with the scores of the reviewers and sees no need to overturn the decision and recommends acceptance for the paper.

The AC reminds the authors that they should update the camera ready with the suggestions from the reviewers and promises that they made during the rebuttal stage.