8.2

/10

Poster4 位审稿人

最低4最高6标准差0.7

4.3

置信度

创新性3.0

质量3.3

清晰度3.3

重要性2.8

NeurIPS 2025

UniTraj: Learning a Universal Trajectory Foundation Model from Billion-Scale Worldwide Traces

Yuanshao Zhu,James Jianqiao Yu,Xiangyu Zhao,Xun Zhou,Liang Han,Xuetao Wei,Yuxuan Liang

OpenReview PDF

提交: 2025-05-04更新: 2025-10-29

TL;DR

We built the first worldwide trajectory dataset and trained a universal trajectory foundation model.

摘要

关键词

Spatio-Temporal Data MiningFoundation ModelUrban Computing

评审与讨论

审稿意见

评分: 6置信度: 52025-06-22

This paper introduces a universal trajectory foundation model called UniTraj, which aims to address the limitations commonly found in existing trajectory modeling methods, such as task specificity, region dependency, and data sensitivity. To this end, the authors introduced three key innovations: first, they constructed a large-scale global trajectory dataset called WorldTrace, which contains trajectory samples from 70 countries. Second, they developed novel adaptive trajectory resampling and self-supervised trajectory masking pre-training strategies, enabling the model to learn robustly from heterogeneous data with different sampling rates and quality. Finally, they designed a flexible encoder-decoder model architecture to adapt to various trajectory tasks and effectively capture complex motion patterns. Extensive experiments demonstrate that UniTraj significantly outperforms existing methods across multiple tasks, both in zero-shot and fine-tuning settings, proving its exceptional scalability, adaptability, and generalization capabilities.

优缺点分析

Strengths:

The research objectives of this paper hold promising insights. The author identifies the preparations required and challenges faced in building a foundation model, and the paradigm may inspire subsequent related research.
The authors constructed the WorldTrace dataset, which is unprecedented in trajectory analysis in terms of its scale and geographical diversity. This dataset provides a solid training foundation for the model in this paper, and also alleviates the problems of data scarcity and geographical bias faced by the community.
The pre-training strategies (ATR and STM) proposed in this paper are designed based on the specific characteristics of trajectory data and can effectively address the common issue of inconsistent data quality in this kind of problem.
The experimental section of this paper is comprehensive and well-founded. The authors evaluated the model on six different real-world datasets, including WorldTrace, thoroughly validating its cross-regional adaptability and supporting the claim that it is a universal foundation model.
This paper provides detailed supplementary materials, including data acquisition, processing, and pre-training strategy details, which effectively help readers understanding.

Weaknesses:

Although the proposed WorldTrace dataset covers 70 countries, the distribution of the data is not representative of some parts of Africa and Asia. This bias may affect the generalizability of the model in underdeveloped regions or regions with sparse data.
To achieve region independence, UniTraj was designed to use only the coordinates and timestamp information of trajectories, intentionally ignoring rich contextual information such as road networks, points of interest (POIs). This is a core design choice, but it may also result in a performance bottleneck.
The paper mentions that the model uses an asymmetric encoder-decoder structure and provides the number of parameters. However, there is a lack of detailed reasons for choosing this particular asymmetric design over other configurations.

问题

In self-supervised pretraining, the authors set specific application ratio for four different masking strategies. Could the authors briefly explain how this ratio combination (e.g., 70% random masking) was determined? Is model performance sensitive to this ratio?
The UniTraj uses RoPE to capture the sequential relationships between trajectory points. Considering that there are other position encoding schemes in Transformer models, why was RoPE chosen, and what are its specific advantages?
From the performance table of the trajectory recovery task (Table 2), it can be seen that compared with high-quality datasets, the model has significantly higher errors on low-quality datasets (such as GeoLife and Grab-Posisi). Can the authors provide some qualitative analysis of the model's failure cases based on the existing experimental results?

局限性

Yes.

最终评判理由

This work proposes a new dataset and foundation model for trajectory data analysis. Considering the long-term scarcity of relevant datasets for research, I believe this work can bring a new perspective to the field. Therefore, I recommend accepting this paper.

格式问题

N/A

作者回复

2025-07-30

W1

We appreciate the reviewer’s thoughtful observation regarding potential geographic imbalance in the WorldTrace dataset. We agree that ensuring global representativeness is crucial for universal modeling. As acknowledged in Appendix E.1 (Limitations), some regions are less represented in WorldTrace due to public data availability and the current state of community-contributed GPS traces.

Despite these imbalances, UniTraj still demonstrates remarkable generalization across a variety of regions, as evidenced by:

As shown in Table 2 and Table 3, UniTraj performs strongly even in regions outside the dominant geographic areas, like Grab-Posisi (from Southeast Asia), contributors of WorldTrace.
In zero-shot settings, UniTraj shows notable robustness on these datasets, demonstrating its cross-regional generalization capacity, even with partial underrepresentation.
The self-supervised pretraining strategies (e.g., ATR and STM) were designed explicitly to mitigate region-specific overfitting by forcing the model to learn from varying sampling rates, motion patterns, and missing data.

W2

We thank the reviewer for highlighting this critical design trade-off. Indeed, our decision to omit contextual features such as road networks and POIs was a deliberate and principled choice to align with our core objective: developing a globally deployable and region-independent trajectory foundation model. Our core motivation for acting in this manner is as follows:

Many prior models rely on region-specific information (e.g., local road topologies, POIs, and semantics), which hinders transferability across cities or countries due to significant infrastructure, semantics, and data availability differences.
Our goal was to train a universal backbone to generalize across diverse geographies and support plug-and-play adaptation without requiring aligned contextual data from every target region. As noted in Lines 240–245 of the main paper, this decision helps UniTraj avoid hard-coded dependencies and enables zero-shot generalization to the unseen areas.

Despite this minimalist design, UniTraj achieves superior performance in zero-shot and fine-tuning settings across diverse datasets and tasks (see Tables 2 & 3, Figure 2). This demonstrates that the model effectively captures complex movement patterns and spatio-temporal dependencies using only core trajectory signals. It also generalizes robustly to unseen regions, which would not be feasible if region-dependent features were required.

W3

Thank you for your constructive feedback and for highlighting the need to clarify our reasoning behind the asymmetric encoder-decoder structure in UniTraj.

Motivation: Our objective was to balance model expressiveness with computational efficiency during pre-training on large-scale trajectory data. The encoder is tasked with learning rich representations from observed trajectory segments, and thus we allocate more layers and capacity to it. The decoder is used only during pre-training to reconstruct the masked trajectory points. During downstream tasks, the decoder is discarded, and only the encoder is used. This design enables efficient fine-tuning and inference.
Empirical Support: We conducted model ablation and parameter sensitivity experiments (Appendix D.4, Figure 3c), which show that increasing the encoder depth improves representation quality up to a saturation point, while deeper decoders provide diminishing returns during reconstruction. This design reduces overfitting risk and improves generalization.

Q1

Thank you for your insightful question regarding the masking design. The masking ratio combination was determined empirically through a grid search on the validation set using the WorldTrace and Chengdu datasets. We observed that prioritizing random masking (50%) provided the strongest generalization and robustness for local and global dependencies, while the remaining ratios introduced targeted challenges relevant to specific downstream tasks.

Results in Section 5.4 (see Figure 3(d)) show that model performance is moderately sensitive to the masking ratio. Using less than 50% random masking or omitting certain masking types degraded performance, while the chosen combination achieved the best balance between reconstruction accuracy and transferability.

Q2

Thank you for highlighting the importance of discussing RoPE.

We selected RoPE over other positional encoding schemes for the following specific reasons:

Relative Position Awareness: Unlike sinusoidal or absolute positional embeddings, RoPE encodes relative distances between points via rotation in embedding space, which is critical for trajectory data where relative motion (e.g., turns, acceleration) matters more than absolute position indices.
Continuous and Scalable: RoPE naturally supports continuous and extrapolatable temporal sequences, making it better suited for modeling variable-length trajectories without discrete position limits or index overflow.

Q3

We are very grateful to the reviewers for their meticulous reading and valuable insights. The higher errors on low-quality datasets such as GeoLife and Grab-Posisi are primarily due to their irregular and sparse sampling intervals, higher levels of GPS noise, and greater diversity in travel modes. Qualitative analysis of failure cases reveals two main patterns:

Infrequent or uneven sampling often leads to large spatial gaps between points, making it challenging for the model to accurately reconstruct complex maneuvers (e.g., sharp turns or mode switches) that occur between observations.
Poor GPS quality and inconsistent data can introduce outliers or implausible points, which may mislead the model during both pretraining and recovery, resulting in larger reconstruction errors.

In addition, from our ablation studies (Table 4), we observed that removing dynamic resampling or key point masking causes sharp degradation on these datasets, indicating that temporal heterogeneity and spatial-level geometry are major failure modes. In some failure cases, the model over-smooths short, abrupt turns or fails to reconstruct loops or detours, especially when large sections are masked.

2025-08-04

Thank you for the author's detailed response. I am very satisfied with the author's clarification, especially the thorough analysis of real-world scenarios in Q3. I suggest the authors integrate the necessary content into the revised version. Based on the current version and the author's response, I will raise my score accordingly.

2025-08-05

We are very happy to address your concerns. We will ensure that the necessary discussion will be included in the revised version. Thank you again for your insightful comments.

审稿意见

评分: 5置信度: 42025-06-27

This paper introduces UniTraj, a general-purpose trajectory foundation model designed to overcome the limitations caused by task dependency, regional specificity, and inconsistent data quality. The authors first construct a large-scale dataset called WorldTrace, which includes over 2.45 million trajectories and 8.8 billion GPS points collected from 70 countries. This dataset focuses on motorized movement and is carefully processed using 1Hz sampling normalization and map matching to ensure quality and consistency.

To train the model, the authors propose a two-part pre-training strategy. The first part is Adaptive Trajectory Resampling (ATR), which resamples trajectory points based on their length and ensures more uniform time intervals. The second part is Self-supervised Trajectory Masking (STM), which combines several masking strategies, random, block-based, keypoint-based, and endpoint masking, to improve robustness and generalization. These strategies allow the model to learn rich and transferable representations.

The architecture of UniTraj is based on a flexible Transformer framework. It uses RoPE and masked modeling techniques during pre-training. The encoder is designed to be versatile and can be used both for zero-shot inference and fine-tuning on specific downstream tasks. As a result, UniTraj demonstrates strong performance across various trajectory-related tasks, including reconstruction, prediction, classification, and generation. It consistently outperforms existing models, such as TrajFM and TrajBERT, particularly in zero-shot transfer settings across various geographic regions.

优缺点分析

Strengths

The concept of a general trajectory foundation model is convincingly supported by scale and experiments.
The WorldTrace dataset is valuable for the community due to its size, diversity, and quality.
The model shows strong zero-shot performance in reconstruction, classification, and generation tasks.
Experiments are carefully designed, including ablations and analysis of geographic bias.

Weaknesses

The model architecture uses many known techniques (RoPE, masked modeling, encoder-decoder), so structural novelty is limited.
The design ignores contextual features like POIs or road networks, and focused on motorized data, which may reduce performance in some settings.
The need for high-end hardware (e.g., A100 GPUs) may limit accessibility and reproducibility.

问题

The paper is generally well-structured and well-written. The following question is what I wanted to clarify:

For zero-shot results in Table 2, the authors mentioned “In the zero-shot setting, UniTraj achieves remarkable results, confirming it effectively captures transferable spatio-temporal patterns without requiring additional fine-tuning. The performance difference becomes particularly instructive when analyzing low-quality datasets like GeoLife and Grab-Posisi, with their highly irregular sampling intervals and multiple travel modes.”
- It is interesting to note the quality of data, and this conjecture holds. For example, GeoLife can cleaned up with their modes and may re-sampling is helpful to make the data more formatted as expected.

局限性

Trajectory data often have privacy issues, but OSM data and their motorized traces seem to public and well-granulated. The authors mentioned some issues in E.2., but it is better to include them in the main text in my opinion.

最终评判理由

The paper is generally well-written.

Although some components seem not to be novel from existing research literature, I respect the FM's study that has a broader technical impact in various fields; after discussing my concerns with the authors, I'll keep my score to push this paper to be accepted for technical discussions.

格式问题

No major problems.

作者回复

2025-07-30

W1

Thank you for your valuable comment regarding the architectural novelty of our work. UniTraj indeed leverages established components such as RoPE, masked modeling, and the encoder-decoder framework. But our contribution goes significantly beyond their direct application. We have customized and innovatively extended these components to address the unique challenges of trajectory data. Specifically:

Adaptive Trajectory Sampling: We introduce the adaptive sampling mechanism to dynamically handle the varying density and irregular intervals typical in trajectory data. This allows the model to better represent dense urban traces and sparse intercity movements, enhancing its generalization and robustness.
Trajectory-Aware Masked Modeling: Unlike conventional masked modeling in NLP or vision, we designed four dedicated masking strategies specifically for spatio-temporal sequences with irregular sampling, frequent missing points, and noise, ensuring effective learning from real-world GPS traces.
Domain-Specific Encoding: We adapted RoPE to encode both spatial and temporal relationships jointly, rather than treating positions as a single sequence, thereby capturing the continuous and multi-dimensional nature of trajectory data.
Unified and Multi-Granularity Pretraining: UniTraj brings together several trajectory-specific pretext tasks (e.g., next-location prediction, trajectory completion) under a single framework, enabling the model to learn diverse mobility patterns across different spatial and temporal scales.

In summary, we construct UniTraj based on proven deep learning modules, primarily driven by extensive experimental validation and customized modifications based on trajectory data. The unified design represents a potential direction for trajectory data modeling.

W2

Thank you for highlighting the limitation regarding contextual features and the focus on motorized trajectory data. We appreciate this important perspective and would like to clarify our design choices:

Generality: Our primary objective was to develop a region-independent and modality-agnostic trajectory foundation model that can learn from raw GPS traces with minimal reliance on external data sources. While incorporating such contextual features can improve performance in localized settings, they limit transferability and are not consistently available across regions, particularly in developing countries or less-mapped areas.
Model Robustness to Diverse Settings: Our self-supervised masking strategies and adaptive resampling help the model handle sparse, irregular, and low-speed trajectories effectively. Therefore, despite the focus on motorized data in WorldTrace, UniTraj demonstrates strong performance on multimodal datasets, such as GeoLife and Grab-Posisi, which include walking, biking, and mixed transport modes.
Contextual Feature Integration: We agree that incorporating contextual information such as POIs and road networks can further enhance model performance. Although our current work does not exploit these features, UniTraj is modular by design and readily extended to integrate such signals as additional inputs or embedding channels in future work.

W3

Thank you for pointing out concerns about hardware requirements and reproducibility. We would like to clarify that our use of A100 GPUs was primarily for accessing the training and evaluation process on large-scale datasets, and is not a necessity for running UniTraj. Inference and fine-tuning can be performed efficiently on widely available consumer hardware, such as an NVIDIA 2080Ti. Specifically, on 2080Ti, UniTraj holds 321.68 M FLOPs. Inference time: 0.131 seconds and Memory usage: 4,241 MB for 1000 samples.

Most importantly, UniTraj is a lightweight architecture (2.38M parameters) that is significantly smaller and more efficient than prior baselines (e.g., TrajBERT and TrajFM), allowing smooth operation on non-server-grade GPUs and even for edge-level deployment with further compression. Below is our comparison of three representative models.

Model	Paras (M)	Inference Time (s/1k) ↓	Memory (MB) ↓	FLOPs (M) ↓
UniTraj	2.38	0.131	4241	321.68
TrajBERT	3.30	0.142	5785	644.57
TrajFM	3.28	0.139	6685	644.51

Q1

Thank you for your insightful observation regarding the impact of data quality and preprocessing. We agree that if sufficient prior information, such as mode labels or regularized sampling intervals, were available, data cleaning and resampling could further improve model performance.

This aligns directly with the motivation behind our ATR strategy, which is designed to handle irregular, noisy, and multimodal sampling patterns without requiring external priors. By randomly constructing a large number of possible trajectory behaviors, we aim to evaluate UniTraj's robustness and transferability under various dynamic constraints.

In our manuscript, we provide a preliminary corollary to support this perspective. Specifically, if we can convert the existing trajectory dataset to a distribution that better matches the target city, the intrinsic uncertainty (entropy) of the modeling task is reduced, making the learning problem better posed and easier for any model, like UniTraj.

Limitation

Thank you for raising the important issue of privacy in trajectory data. We fully agree that privacy is a fundamental concern in mobility data research. While our experiments utilize public and well-processed datasets such as OSM-based traces, which are less likely to contain personally identifiable information. We also recognize that privacy risks may be relevant.

As you pointed out, we discussed this issue in Section E of the appendix. In response to your suggestion, we will move and expand this discussion into the main text of the revised manuscript. Apart from general privacy risks related to trajectory data, we will also emphasize the handling of public and anonymous datasets in this study, as well as how to ensure privacy removal.

Thank you again for your suggestion, which will help us make the manuscript more comprehensive and responsible.

评论- Thanks

2025-08-02

Thank you for your comments. As you know, I posted a positive rating in the first review. Still, I appreciate the comments from the authors that clarify the raised comments (W1-W3, questions, limitations).

Comments on W1 and W3 are well-documented; I understand your replies. For Q1 and limitations, in addition to ethical discussions in the thread, I can understand the stated comments. These are OK for this rebuttal phase.

A minor question is for W2 (i.e., model robustness). The authors mentioned as follows.

Therefore, despite the focus on motorized data in WorldTrace, UniTraj demonstrates strong performance on multimodal datasets, such as GeoLife and Grab-Posisi, which include walking, biking, and mixed transport modes.

I understand some of the quoted data (e.g., GeoLife). To my knowledge, they include some mixed (or other) transport mode data, but they are not the majority. I'd recommend giving details of the data (e.g., ratio of transportation modes) to clarify the model's robustness if you can.

2025-08-03

We sincerely thank the reviewers for their appreciation and insightful comments, and we are pleased that our previous clarifications have alleviated your concerns.

Regarding the proportion of travel modes in datasets with mixed transportation, we provide the following details:

Geolife	Grab-posisi
Walk: 31.37%
Bike: 16.46%
Car: 21.04%	Car: 50.08%
Bus: 31.13%	Motorcycle: 49.92%

We would like to express our gratitude to the reviewers for their valuable feedback and helpful suggestions.

2025-08-04

I appreciate your reply.

The posted table is interesting. We believe including a new explanation for this aspect in the main text shows the utility of UniTraj for multi-modality of transportation data.

2025-08-05

Thank you again for your thoughtful response and suggestions. We will ensure that these discussions are incorporated into the updated version.

审稿意见

评分: 5置信度: 52025-06-29

This paper introduces UniTraj, a universal trajectory foundation model designed to overcome the limitations of existing trajectory modeling approaches—namely, task specificity, regional dependency, and sensitivity to data quality. The authors present three major contributions: (1) the construction of WorldTrace, a large-scale, diverse, and high-quality trajectory dataset sourced from over 70 countries and containing 2.45 million trajectories and 8.8 billion GPS points; (2) a set of novel pre-training strategies, including Adaptive Trajectory Resampling and Self-supervised Trajectory Masking, which enable robust learning from heterogeneous, sparsely sampled trajectory data; and (3) a flexible encoder-decoder architecture using spatial-temporal embeddings and rotary positional encodings for general-purpose trajectory modeling tasks. Extensive experiments across recovery, prediction, classification, and generation tasks demonstrate UniTraj's superior performance in both zero-shot and fine-tuning scenarios. Ablation studies validate the effectiveness of each pre-training component, and the model shows strong generalization across datasets with varying quality and geographic contexts. The work is a substantial and timely contribution to spatiotemporal AI, bringing foundation model thinking to the mobility and trajectory modeling domain. The ideas are novel, well-motivated, and empirically validated, making it a strong candidate for acceptance at NeurIPS.

优缺点分析

Strength (1) The paper tackles a critical and timely challenge in trajectory modeling: building a universal foundation model that generalizes across tasks, geographies, and data qualities. The authors clearly identify the fragmentation in current approaches and the missed opportunity to bring foundation model principles—successful in NLP and vision—to the mobility domain. (2) A key contribution is the curation of the WorldTrace dataset, which provides an unprecedented scale (2.45M trajectories, 70 countries) and diversity for trajectory modeling. (3) The proposed pre-training strategies are carefully designed for the trajectory domain’s unique characteristics, including irregular sampling and incomplete data. The dynamic resampling technique handles varying trajectory lengths, while the masking strategies simulate missing points in both local and global contexts. (4) UniTraj achieves consistent state-of-the-art results across a wide range of tasks and datasets.

Weakness (1) The model does not provide any analysis of computational efficiency or inference latency, which is crucial for deployment. (2) The paper lacks discussion of ethical implications or potential privacy risks related to GPS trace use and trajectory prediction.

问题

(1) What are the model’s training and inference times on large-scale datasets? Could the authors provide FLOP or runtime comparisons against baselines? (2) How is user privacy preserved? What safeguards exist beyond removing identifiers, especially when releasing samples or models?

局限性

Yes

最终评判理由

After read the rebuttal of the authors and the discussion, I decide to keep my score unchanged. The authors have addressed my concerns on computational efficiency and data privacy issues during model training. Generally I think this is a good paper with clear contritions on universal trajectory foundation model which has broad application scenarios. I recommend to accept the paper.

格式问题

N/A

作者回复

2025-07-30

W1

Thank you for your comment regarding computational efficiency and inference latency. While the main submission prioritized modeling and generalization, we now provide evidence that UniTraj offers superior computational efficiency over representative trajectory models:

Model	Paras (M)	Inference Time (s/1k) ↓	Memory (MB) ↓	FLOPs (M) ↓
UniTraj	2.38	0.131	4241	321.68
TrajBERT	3.30	0.142	5785	644.57
TrajFM	3.28	0.139	6685	644.51

The above result was conducted on a server with Nvidia GeForce RTX 2080 Ti GPU and Intel(R) Xeon(R) Silver CPU. We can gain the following insights:

Parameter Efficiency: UniTraj requires significantly fewer parameters than both TrajBERT and TrajFM, reducing memory and storage demands during deployment.
Inference Speed: All three models achieved very fast inference speeds, with less than 0.14 seconds per 1,000 samples.
Memory Usage: UniTraj consumes less memory than both TrajBERT and TrajFM under identical batch and sequence settings.
Lower Computational Cost: UniTraj has the lowest FLOPs, indicating better computational efficiency.

These results demonstrate that UniTraj achieves superior computational efficiency and faster inference with a much smaller model footprint, making it better suited for real-world deployment scenarios, especially those with hardware or latency constraints.

W2

We appreciate the reviewer’s important observation regarding ethical and privacy considerations. While the main paper focused on technical contributions due to space constraints, we also carefully considered the ethical and privacy aspects of trajectory data usage in Appendix A.4. Below are key clarifications:

Source and Anonymity: All GPS trajectories used in our work were collected from OpenStreetMap’s public GPS trace platform, which operates under the Open Data Commons Open Database License (ODbL). These traces are user-contributed and anonymized, containing no personal identifiers, and are limited to public movement paths (e.g., roads, highways).
No Individual-Level Modeling: UniTraj does not infer or use any user identity or metadata. It models aggregate spatio-temporal patterns, not individual behavior, making re-identification infeasible.
Model Application Scenarios The model is trained and evaluated exclusively on transportation-scale movements, not on private or personal location histories. Its intended applications are in domains such as urban mobility analysis, logistics, and transportation planning, not surveillance or individualized tracking.

We will include a dedicated description in the revised manuscript to explicitly discuss these ethical considerations and privacy safeguards. Thank you again for bringing this critical aspect to our attention.

Q1

Thank you for highlighting the importance of computational efficiency at scale. We have benchmarked UniTraj and recent baselines (TrajBERT, TrajFM) on large-scale datasets, reporting both FLOPs and runtime metrics. The experiment was set up on a computing server with an Nvidia GeForce RTX 2080 Ti GPU and Intel(R) Xeon(R) Silver CPU.

Model	Time per 1k Samples (s)	Memory (GB)	FLOPs (M)
UniTraj	0.131	4,241	321.68
TrajBERT	0.142	5,785	644.57
TrajFM	0.139	6,685	644.51

Specifically, we can see that the three models have very similar inference times and can quickly complete the processing of thousands of samples. In addition, UniTraj requires substantially fewer FLOPs than both baselines, reflecting its computational efficiency. Therefore, UniTraj offers superior efficiency and scalability for both training and inference, making it well-suited for deployment on large-scale trajectory data.

Q2

Thank you for raising the crucial issue of user privacy and identifiers. Below are the safeguards we incorporated or the dataset held:

Data Source and Licensing: All trajectory data used in UniTraj comes from OpenStreetMap (OSM) GPS traces, which are public, anonymized, and shared under the ODbL. The original data contains no user identifiers, device IDs, or metadata that can be used to trajectory individuals.
Preprocessing for De-Identification: During dataset construction (see Appendix A), we applied filtering and spatial-temporal normalization, including short trips, loops, and localized trajectories. In addition, we also use map-matching to align raw points to road networks, further abstracting fine-grained individual movement.
Sample Released: The publicly shared dataset includes only a processed, curated subset of trajectories.
Model-Level Safeguards: UniTraj learns from aggregated movement patterns, not individual users, which cannot reconstruct or trace back individual identities or raw trajectories. Moreover, we do not publish fine-tuned models on private or proprietary datasets, avoiding potential overfitting to any local patterns.

审稿意见

评分: 4置信度: 32025-07-03

This paper presents UniTraj, a unified foundation model designed for learning from spatiotemporal trajectories across diverse geographic and application domains. The model is based on a Transformer encoder-decoder architecture and is pretrained using self-supervised learning on a newly constructed large-scale dataset called WorldTrace, which contains 2.45 million trajectories from over 70 countries.

To improve generalization across trajectory types and regions, the authors introduce two domain-specific pretraining techniques:

Adaptive Trajectory Resampling (ATR), which adjusts sampling density based on trajectory length to preserve structure while normalizing sequence size.
Self-supervised Trajectory Masking (STM), which applies a variety of masking strategies (random, block, key-point, and last-N) to simulate missing or partial observations during training.

UniTraj is designed to support multiple downstream trajectory tasks—including trajectory recovery, future prediction, semantic classification, and trajectory generation—using lightweight task-specific adapter modules. The model achieves strong results across a range of datasets and tasks, including in zero-shot and cross-region generalization settings, demonstrating its potential as a general-purpose trajectory learning framework.

优缺点分析

Strengths:

Comprehensive Evaluation: The paper evaluates UniTraj on four diverse tasks—trajectory recovery, future prediction, classification, and generation—across a variety of real-world datasets from different countries. This supports the claim of broad generalizability.
Effective Pretraining Techniques: The proposed self-supervised strategies—Adaptive Trajectory Resampling (ATR) and Self-supervised Trajectory Masking (STM)—are thoughtfully designed for trajectory data and show consistent benefits in ablations.
Robust Performance: UniTraj consistently outperforms existing baselines, including zero-shot transfer across cities, demonstrating strong cross-domain generalization.
Novel Application of Pretraining Strategies: While inspired by techniques from vision and NLP, ATR and STM are well-adapted to the trajectory domain and have not been widely applied at this scale.

Weaknesses:

Potential Data Bias: WorldTrace is constructed from OpenStreetMap GPS traces, which are crowd-sourced and known to be biased toward car-based travel in urban regions, particularly in North America and Western Europe. This could limit the model’s generalization to underrepresented regions or modes of transport (e.g., walking, cycling, transit).
No Multi-Agent Modeling: Inter-agent interactions (e.g., traffic, crowds) are not modeled, which limits applicability to domains where joint behavior is essential.
Methodologically Incremental: ATR and STM are useful adaptations, but not new techniques. The architecture is based on a standard Transformer encoder-decoder.

问题

Why did you choose a Transformer encoder-decoder architecture rather than an encoder-only (like BERT) or encoder-decoder with causal masking? Would an encoder-only model suffice for non-generative tasks?
Are you open-sourcing the dataset?
How does performance degrade on low-sample regions?
What’s the inference latency and memory footprint of UniTraj?
How sensitive is the model to trajectory length at inference?
Provide per-continent error to diagnose regional bias.

局限性

Yes

最终评判理由

The rebuttal addressed many of my original questions, especially around efficiency, geographic generalization, and length sensitivity. These responses increased my confidence in the applicability of UniTraj.

格式问题

No major formatting issues

作者回复

2025-07-30

W1

Thank you for raising the important point regarding potential data bias. We agree that crowd-sourced data can exhibit geographic and modality bias. To address this concern, we provide two complementary pieces of evidence that demonstrate UniTraj’s generalization ability beyond those biases:

We explicitly evaluated UniTraj on GeoLife and Grab-Posisi, which include mixed transportation modes beyond the WorldTrace dataset, including walking, cycling, and public transit. As shown in Table 2 of the main results section, UniTraj achieves competitive performance on these datasets. It indicates that the model effectively generalizes to non-motorized movement patterns and non-developed regions, even without being trained on those exact distributions.
To further investigate the potential geographic bias, we analyzed UniTraj’s performance across different continents in the WorldTrace dataset. The results are summarized below:

North America Asia Europe Africa Oceania South America
MAE 10.58 9.42 10.03 10.61 10.12 9.98
RMSE 13.84 12.38 13.29 13.46 13.17 12.89
Percentage 67.97 24.68 4.40 0.20 1.52 1.23

These results show that no region suffers from severe degradation, and in fact, performance in Asia and South America is comparable or better than that in North America or Europe. This suggests that UniTraj does not overfit to overrepresented regions, and that its self-supervised, region-agnostic design (e.g., no POIs, no road networks) allows for stable cross-continental generalization.

W2

We appreciate the reviewer highlighting this limitation. We agree that explicitly modeling inter-agent interactions will improve the performance of the trajectory model. We would like to clarify that the core motivation for making this decision is as follows:

Our current design is motivated by the need to establish a strong, generalizable, and region-agnostic framework. We hope it can be usable across domains and cities where inter-agent annotations are unavailable or infeasible to collect at scale.
Modeling multi-agent interactions typically requires high-resolution temporal alignment, shared agent IDs, and consistent scene-level context, which are often missing in large-scale, real-world GPS datasets.
Despite modeling trajectories individually, UniTraj can achieve remarkable performance. This suggests that our self-supervised trajectory learning framework captures meaningful dynamics even without explicit interaction modeling.
UniTraj is designed with a modular architecture: while it focuses on per-agent trajectory encoding, inter-agent relations could be integrated in downstream pipelines. This allows for plug-and-play compatibility with interaction-aware components without retraining the backbone.

W3

We thank the reviewer for this observation and valuable comment. We agree that the core components of UniTraj are built upon established foundations. However, we would like to clarify that the methodological novelty of UniTraj as outlined below:

Domain-Specific Innovations: We designed a unique pre-training strategy based on the unique properties of trajectories. ATR is a novel combination of multi-scale and interval-consistent resampling, tailored to address the heterogeneity of real-world GPS trajectories, including variable sampling intervals, missing points, and inconsistent granularity. We adapted RoPE to encode both spatial and temporal relationships jointly, rather than treating positions as a single sequence, thereby capturing the continuous and multi-dimensional nature of trajectory data.
Unified Pretraining: We also introduce four complementary masking strategies (STM), which are uniquely adapted to trajectory semantics. This is not a direct reuse of BERT-style masking but a task-aware design. UniTraj brings together several trajectory-specific pretext tasks (e.g., next-location prediction, trajectory completion) under a single framework, enabling the model to learn diverse mobility patterns across different spatial and temporal scales.
Empirical Impact: Despite being an incremental architecture, our approach substantially advances in trajectory modeling, as evidenced by superior performance on diverse and challenging benchmarks.

We are committed to proving that, with the right pretraining strategy and data design, even a standard encoder-decoder architecture can achieve powerful generalization and reusability across multiple domains.

Q1

Thank you for this insightful question regarding our architectural choices.

Why Not Encoder-Only (e.g., BERT)?

We selected an encoder-decoder architecture to better support a broad range of trajectory modeling tasks within a single, unified framework. The encoder-decoder structure, with its “destruction and reconstruction” paradigm, allows the model to more effectively learn complex spatio-temporal dependencies in trajectory data. This design is also more flexible and plug-and-play, making it adaptable to diverse downstream tasks. Additionally, the decoder can act as a projector, further enhancing the model’s versatility for various applications.

Why Not Causal Masking?

For many trajectory-related tasks, especially classification or masked point recovery, utilizing both past and future context is highly beneficial. Standard causal masking restricts the model to accessing only past information, which limits its ability to recover missing or irregularly sampled trajectory points. By adopting a flexible masking strategy rather than strict causal masking, our model can seamlessly handle both prediction and imputation tasks, thus improving its overall utility and efficiency.

Q2

Thank you for your interest in the dataset. We will open-source the worldtrace dataset and ensure that its release fully complies with all relevant copyright and privacy regulations. Data sharing will follow appropriate protocols to protect individual privacy and respect data usage agreements.

Q3, Q6

We thank the reviewer for this important question. To evaluate performance across regions with different sample densities, we conducted a continent-level analysis using trajectories from each region. The results are summarized below:

	North America	Asia	Europe	Africa	Oceania	South America
MAE	10.58	9.42	10.03	10.61	10.12	9.98
RMSE	13.84	12.38	13.29	13.46	13.17	12.89
Percentage	67.97	24.68	4.40	0.20	1.52	1.23

From the table, we can observe that despite large differences in data volume, UniTraj maintains stable error metrics across all regions, with no drastic degradation in low-sample settings. These results demonstrate that UniTraj maintains robust generalization capabilities, even in regions with limited training data. We attribute this to the model’s region-agnostic design and pretraining objectives that effectively support generalization to underrepresented geographies, even when region-specific data is sparse.

Q4

We thank the reviewer for raising this practical concern. We conducted a comparison of inference latency, memory footprint, and FLOPs among UniTraj and two representative baselines (TrajBERT and TrajFM). The experiment was set up on a computing server with an Nvidia GeForce RTX 2080 Ti GPU and Intel(R) Xeon(R) Silver CPU. The results are as follows:

	Time 1k/ samples	Memory (MB)	FLOPs (M)
UniTraj	0.131	4241	321.68
TrajBERT	0.142	5785	644.57
TrajFM	0.139	6685	644.51

From the above analysis, we can see that the three models have very similar inference times and can quickly complete the processing of thousands of samples. However, Unitraj requires approximately half the FLOPs of TrajBERT and TrajFM, and its memory usage is also significantly reduced. These results demonstrate that UniTraj is not only more efficient in terms of memory and computation but also provides faster inference, making it well-suited for large-scale and real-time trajectory applications.

Q5

We thank the reviewer for this insightful question. To assess sensitivity to trajectory length, we grouped test samples by their length and computed the MAE and RMSE for each group:

	All	<60	60--120	120--180	180>
MAE	10.25	9.91	9.39	9.89	10.60
RMSE	13.43	13.05	12.53	13.04	13.79

From the table, we can observe that uniTraj demonstrates stable performance across all trajectory length ranges, with optimal results on medium-length trajectories (60–180). There is only a slight decrease in performance for very short (<60) and very long (>180) sequences. For very short trajectories, the limited amount of information makes it challenging for the model to fully capture spatio-temporal dependencies. For very long trajectories, the increased sequence length introduces more variability and noise, making the modeling task more difficult. It is worth noting that for long sequences (>180), the MAE increases by only about 3.5% compared to the overall average. This indicates that UniTraj maintains robust generalization even when handling trajectories with longer durations and higher complexity. In summary, UniTraj achieves consistently strong and stable performance across a wide range of trajectory lengths.

2025-08-04

That said, I still believe there is no architectural novelty in the proposed model. UniTraj uses a standard Transformer encoder-decoder architecture without introducing new components or mechanisms. The strength of the paper lies in its pretraining strategies and dataset scale, not in model design.

评论- Kind remind for author-reviewer discussion

2025-08-05

Dear Authors and Reviewers,

Thank you for submitting and reviewing the papers to contribute to the conference. This is a kind remind that the due date of author-reviewer discussion is coming soon. Please participate the discussion to clarify paper statement or concerns.

Thanks!

最终决定Accept (poster)

2025-09-17

The manuscript introduces UniTraj, a novel trajectory foundation model addressing task specificity, regional dependency, and data quality issues. It presents the WorldTrace dataset, innovative pretraining strategies, and a Transformer-based architecture with RoPE, achieving strong zero-shot and cross-regional performance.

Reviewers recognize the dataset’s scale, pretraining innovations, and comprehensive evaluation but note limited architectural novelty, geographic bias in WorldTrace, and insufficient ethical discussion (e.g., re-identification, surveillance risks). The rebuttal addresses efficiency, bias, and privacy concerns, committing to an ethics section.

I recommend acceptance. However, the authors should revise the ethics-related section addressing re-identification, surveillance, consent, and bias. The paper would make substantial contributions to spatiotemporal AI if ethical concerns are resolved.