4.2

/10

withdrawn5 位审稿人

最低3最高5标准差1.0

4.0

置信度

正确性2.4

贡献度1.8

表达2.8

ICLR 2025

UrbanDiT: A Foundation Model for Open-World Urban Spatio-Temporal Learning

Yuan Yuan,Chonghua Han,Jingtao Ding,Guozhen Zhang,Depeng Jin,Yong Li

OpenReview PDF

提交: 2024-09-15更新: 2025-01-15

TL;DR

We introduce a foundation model for open-world urban spatio-temporal learning, which integrates diverse data types and address multiple tasks.

摘要

关键词

Foundation modelUrban Spatio-Temporal LearningDiffusion TransformerPrompt Learning

评审与讨论

审稿意见

评分: 3置信度: 42024-10-21

This paper explores an emerging area of building spatio-temporal foundation models. The authors propose UrbanDiT, a general pre-trained model for various urban spatio-temporal applications. Specifically, UrbanDiT first transforms grid- and graph-based urban data into a unified sequential format. After that, UrbanDiT adopts diffusion transformer to capture spatio-temporal dependencies and encode rich urban knowledge. Additionally, different masking strategies and task-specific prompts are utilized to guide the learning process. Experimental results demonstrate superior results of UrbanDiT against state-of-the-art baselines.

优点

This paper studies general urban spatio-temporal learning, which is interesting and has wide-ranging potential applications.
The logical structure of the paper is well-organized and easy to follow.

缺点

The motivation of this paper is not well-justified. The choice of diffusion transformers seems arbitrary and does not convincingly address why they are the most suitable model, compared with STGNNs or standard transformers, for urban spatio-temporal learning tasks.
The technical novelty is limited compared with previous literature. The idea of sequential unification strategy seems to come from MOIRAI [1], and the design of masking and prompt are very similar to UniST [2], which also focuses on urban spatio-temporal learning. Besides, the authors directly employ diffusion transformer without specific modification for spatio-temporal data.
The experimental setup is unclear. The authors provide an overview of the datasets used, but it is not explicitly stated which datasets were used for pretraining and which were used solely for evaluation. This lack of clarity raises concerns about the fairness of the few-shot and zero-shot comparisons. If all datasets were included in the pretraining phase, the zero-shot and few-shot results may not provide an accurate reflection of the model's generalization capabilities.
UrbanDiT presents promising results on the zero-shot and few-shot capabilities, but these evaluations are limited to two datasets, TaxiBJ and PopSH, which are both grid-based flow datasets. To strengthen the generalization claims, it would be beneficial to include evaluations on more heterogeneous datasets, e.g., PEMS-BAY, PEMS-04, PEMS-07, and PEMS-08.
In zero-shot and few-shot experiments, the authors did not include comparisons against some of the latest large pre-trained models, such as TTM [3] and OpenCity [4]. Both of them offer open-source codes and pre-trained model weights, demonstrating promising results in zero-shot and few-shot scenarios.
The authors flatten the input data into a one-dimensional sequential format. For datasets with a large number of nodes, such as SpeedSH with over 20,000 nodes (i.e., 20,000 tokens), this approach could lead to significant computational challenges. Given the quadratic complexity of the transformer architecture with respect to the input sequence length, processing such large inputs could result in severe efficiency bottlenecks.
The authors mention that the datasets used in the experiments are publicly available. However, upon reviewing the provided link, I was unable to locate the datasets.
As an ICLR submission, the paper would benefit from stronger theoretical foundations. For instance, the authors claim that the prompts learned from the pretraining data can generalize well to unseen datasets. It would be interesting to see a theoretical explanation of why and how these learned prompts are able to generalize effectively to new, unseen urban environments.

[1] Woo, Gerald, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. "Unified Training of Universal Time Series Forecasting Transformers." In Forty-first International Conference on Machine Learning.

[2] Yuan, Yuan, Jingtao Ding, Jie Feng, Depeng Jin, and Yong Li. "Unist: a prompt-empowered universal model for urban spatio-temporal prediction." In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 4095-4106. 2024.

[3] Vijay, E., Arindam Jati, Pankaj Dayama, Sumanta Mukherjee, Nam Nguyen, Wesley Gifford, Chandra Reddy, and Jayant Kalagnanam. "Tiny Time Mixers (TTMs): Fast Pre-trained Models for Enhanced Zero/Few-Shot Forecasting of Multivariate Time Series." arXiv (2024).

[4] Li, Zhonghang, Long Xia, Lei Shi, Yong Xu, Dawei Yin, and Chao Huang. "OpenCity: Open Spatio-Temporal Foundation Models for Traffic Prediction." arXiv preprint arXiv:2408.10269 (2024).

问题

Could the authors provide more detailed insights into the motivation behind using diffusion transformers for urban spatio-temporal tasks? What are the underlying principles that make diffusion transformers particularly well-suited for modeling urban spatio-temporal data?
Could the authors explain the key difference between spatio-temporal learning and multivariate time series learning in the context of this paper? Specifically, how does UrbanDiT address the unique challenges of spatio-temporal data that differ from standard multivariate time series prediction?
In Table 4, a straightforward baseline for spatial extrapolation could be to use the nearest observed spatial point's value as the prediction for unobserved points. Could the authors provide the results for this simple baseline to give a clearer understanding of how UrbanDiT compares to basic spatial extrapolation methods?

伦理问题详情

审稿意见

评分: 3置信度: 42024-11-03

This paper presents UrbanDiT, a foundational model leveraging diffusion transformers for urban spatio-temporal prediction problems. UrbanDiT integrates grid-based and graph-based spatio-temporal data and supports various tasks, such as bi-directional prediction, temporal interpolation, spatial extrapolation, and spatio-temporal imputation, using a unified prompt learning framework like UniST [1], i.e., using the input data to retrieve memory pools with diverse spatio-temporal patterns. Experiments on various spatial-temporal problems, including taxi demand, cellular network traffic, crowd flows, transportation traffic, and dynamic population, verify the universality of UrbanDiT.

优点

UrbanDiT is able to handle both grid-based and graph-based data and diverse urban spatio-temporal tasks using a unified framework.
UrbanDiT shows promising zero-shot ability on PopSH dataset.

缺点

The key idea that using prompt learning to achieve versatility on diverse datasets is almost the same as UniST [1]. Both works generate prompts by leveraging the input data to retrieve memory pools containing diverse spatio-temporal patterns. It makes this work’s technical innovation very limited.
The contribution is overclaimed. As this work is similar to UniST, which presents a unified model to address various urban spatio-temporal prediction problems, it is not the first work to explore a foundation model for urban spatio-temporal learning.
The zero-shot experiments are insufficient that only results of PopSH and TaxiBJ are provided. How about the zero-shot performance on other datasets and tasks? Additionally, as UrbanDiT’s zero-shot performance is not always better than other models with training data (refer to Figure 9), how about the zero-shot comparison with other baseline models?
The paper only provides ablation results on TaxiBJ. It’s expected to see the results on other datasets, which would be beneficial to conclude more compelling insights.

[1] Unist: a prompt-empowered universal model for urban spatio-temporal prediction. KDD, 2024.

问题

Please refer to the Weaknesses.

审稿意见

评分: 5置信度: 42024-11-04

This work presents an urban spatio-temporal foundation model that can be applied to various data types and tasks. The experimental results demonstrate the effectiveness of the proposed method.

优点

(1) The proposed model can be used for predictions across different types of spatio-temporal data.

(2) The proposed method can handle multiple spatio-temporal tasks, including spatio-temporal extrapolation, interpolation, and prediction.

(3) UrbanDiT demonstrates zero-shot generalization capabilities.

缺点

(1) The technical approach and problems addressed in this paper are very similar to the published paper UniST[1]. Both use masked recovery as the pre-training task, employ spatio-temporal Transformer as the basic architecture, and enhance model generalization across different data and tasks through spatio-temporal prompt learning. The differences mainly lie in technical details, such as using Diffusion Transformer instead of Masked Autoencoder as the basic architecture. This results in insufficient contributions of the paper.

(2) While the paper mentions similar large models like GPD, UniST, UrbanGPT, and CityGPT, the comparative experiments lack most of these spatio-temporal models. Including these comparisons would make the results more convincing.

(3) Technical and experimental details are incomplete and require more detailed explanations. For example: How are the prototype prompts in the Memory Pool obtained? What are the data sources during pre-training? Is the 6:2:2 split only for downstream tasks? Does UrbanDiT require retraining on target data/tasks during the downstream phase, such as in the experiments in Section 4.2?

(4) Scalability needs further exploration. The paper only discusses data scalability but lacks parameter scalability experiments, which is crucial for foundation models.

问题

Please refer to the Weaknesses.

审稿意见

评分: 5置信度: 52024-11-04

The paper proposes UrbanDIT, a foundational spatiotemporal learning model based on diffusion transformers that can handle multiple data types, data sources, and task scenarios. By converting both grid data and graph data into sequential forms, the authors have unified these two representations of urban data. To adapt the model to various data types and tasks, the authors have generated task-specific and data-specific prompts within a prompt learning framework, enabling broad support for diverse data and tasks. Combining the powerful generative capabilities and adaptability of diffusion transformers with flexibility across data types and tasks, UrbanDIT demonstrates notable performance and generalization capacity. Extensive experiments on traffic datasets from four cities validate the model’s performance and the effectiveness of each component.

优点

The paper innovatively integrates diverse tasks, including interpolation, extrapolation, prediction, and imputation, using a mask and task-specified prompt framework, demonstrating the homogeneity of these tasks and the feasibility of a unified approach. This broadens the application scope of urban models.
UrbanDIT proposes a unified processing method for different types of traffic data, confirming the effectiveness of serializing grid and graph data.
The paper conducts comparative experiments using a wide range of baselines and extensive datasets, fully validating the model’s performance. Additionally, ablation studies confirm the effectiveness of each model component.
Powerful zero sample performance: UrbanDiT performs well in zero sample and few sample learning, with strong generalization ability.

缺点

The paper lacks information on the GPU used and the efficiency of the model. There are concerns about whether training across all datasets and tasks based on DiT could result in significantly high computational complexity.
Given the authors' data processing approach, relying solely on the reshape function cannot restore the data to its original shape. Therefore, an output layer capable of handling various data formats is essential. However, it is puzzling that neither the paper nor the model diagram addresses the design of the output layer.

问题

Could you provide a more detailed explanation of how the graph data is processed into a sequence, ideally including changes in data shape throughout the process? According to the appendix, grid data constructs patches along spatial dimensions, whereas graph data does not appear to, which may hinder the unification of these two data formats.
For all tasks, is the model trained to predict all missing values based on the non-missing values? If so, might comparisons with CSDI be unfair, as CSDI does not have access to the ground truth of missing values during training?
What is the computational complexity and resource consumption of your model under different tasks and data scales? In practical applications, especially for large-scale urban data, are there any plans or suggestions to optimize computational efficiency?

审稿意见

评分: 5置信度: 32024-11-05

This paper introduces UrbanDIT, another FM for urban environments. When compared with previous urban FMs, UrbanDIT can handle both grid and graph data, leverage diverse data sources, is task flexible, and has zero-shot capabilities. In terms of pipeline, UrbanDIT uses diffusion models, memory module, prompt learning, and various masking strategies. The main contribution is UrbanDIT as a novel benchmark for FM in urban domain.

优点

originality: good / incremental. Given the popularity of multi-modal LLM (MM-LLM) and the task-flexibility and zero-shot capabilities that comes with them, it is to be expected that FM in other domains such as urban environment should have these capabilities as well. The contribution of this paper is in the execution.
quality: good. The selection of data, task, experiments, and analysis is sufficiently extensive from the perspective of an computer science paper that presents a new benchmark for FM. However, the extent of the utility of the resulting model for the stake holder in the urban environment domain has not been made clear.
clarity: good. It is easy to follow an understand all of the main ideas. However, some details and choices are not made clear.
significance: Excellent. Smart city is a rapidly growing field with a great potential, given the impact of cities on the planet from the perspective of energy, pollution, and sheer number of people living in urbanized environment. FM for urban environment has a great potential to revolutionaize this field, similar to how MM-LLM became mainstream in the domain of text document, images, and codes.

缺点

W1: The lack discussion on limitation and future works to briefly elaborate the gap, or at least the next step, between the proposed novel FM as a benchmark to a product that provide utility to stakeholders in the urban environment domain.

W2: Table 1 listed 4 other existing models that is most similar to UrbanDiT. However, none of these 4 models appear in the tables in the result sections, and UniST is the only one listed in subsection C1. There might be valid reasons to not include them, but it is not provided.

W3: Since unified prompt learning is a key design, more analysis than ablation is required. It would be good to have more analysis to have some insight into the interpretability of the unified prompt learning module. For example, are there any semantic meaning behind the entries in the memory pool that we can discover?

W4: There are many possible zero-shot setup, but this paper only explored one, that is the novel dataset setup. Another straight forward one is the novel task. Even the current setup, testing on PopSH, is not a truly zero-shot, since model has learned something about the population dataset (from PopBJ), and the city Shanghai (from SpeedSH).

W5: There is a lack of analysis to show the bottleneck of the current setup. To truly "inspire future research in the rapidly evolving field of foundation models" in the urban domain, it would be good if the authors can share the lessons learned from running this extensive experiments regarding the bottle neck of the domain, i.e. is it data, or compute, or lack of standardized benchmark (data + setup) or something else?

问题

Could you elaborate on the limitations of UrbanDiT in its current form? Specifically, what challenges do you foresee in transitioning from a foundational model to a practical application for urban stakeholders?
What were the criteria for selecting the four existing models listed in Table 1, and why were they not included in the results comparison?
Can you provide more insights into the unified prompt learning framework; such as, what semantic meanings, if any, can be derived from the entries in the memory pool?
What is the performance for zero-shot to completely new task, city, and feature?
What are the lessons learned from these extensive experiments, besides the architecture choice, that might be useful for future research?

撤稿通知

2025-01-15

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.