Thanks so much for your detailed and constructive comments. Please see our below responses to your concerns one by one.

W1: The novelty and contributions of our model

Cross-city flow generation is an emerging area with limited existing efforts. This is because flow generation itself is a challenging problem—it requires geographic features as input to map static spatial structures to dynamic flows. Moreover, the cross-city setting further requires the model to handle domain shifts in both static geographic features and dynamic flow patterns across different cities, even in a completely zero-shot manner.

Therefore, this area remains in its early stages and is still under explored. Existing works primarily focus on specific types of data, such as origin-destination (OD) data[1] or cellular signal data[2], which often come with type-specific priors that simplify the task.

In contrast, we targets generation on general flow types which is under-explored, and address two critical challenges: “severe domain shift” and “insufficient conditions”.

To address domain shift, we were among the early efforts to introduce optimal transport theory into flow tasks and proposed the GFA module. Compared with prior alignment approaches, GFA adaptively selects the most relevant source regions for each target region, enabling finer-grained alignment.
To address insufficient conditions, we adapted RAG for flow generation and developed the RCA module, which integrates dynamic flow information from source cities under zero-shot settings. Most prior works relied solely on static geographic features. RCA allows our model to adapt to more complex flow patterns, resulting in more accurate flow generation.

[1]A Large-scale Benchmark Dataset for Commuting Origin-destination Matrix Generation

[2]Deep Transfer Learning for City-scale Cellular Traffic Generation through Urban Knowledge Graph

W2 & Limitation 2: Detailed analysis of the retrieval augmented module (RCA)

For a given target region and timestamp, RCA retrieves the most similar historical flow segments from source cities based on two features: geographic similarity and temporal proximity.

We conducted ablations on each feature to assess their individual contributions. The results are shown in the table below. Specifically, we averaged the flow segments retrieved by RCA and measured their similarity to the ground-truth flow in the target city using Pearson, Spearman, and Kendall metrics.

Target City	Method		Inflow			Outflow
		pearson	spearman	kendall	pearson	spearman	kendall
CHI	RCA w/o time	0.330	0.314	0.215	0.330	0.316	0.217
	RCA w/o geo	0.606	0.692	0.507	0.599	0.680	0.495
	RCA	0.830	0.840	0.652	0.826	0.847	0.660
DC	RCA w/o time	0.280	0.270	0.188	0.268	0.249	0.172
	RCA w/o geo	0.573	0.624	0.451	0.575	0.629	0.453
	RCA	0.792	0.832	0.644	0.767	0.813	0.622
Toronto	RCA w/o time	0.205	0.207	0.143	0.210	0.214	0.147
	RCA w/o geo	0.707	0.737	0.549	0.700	0.726	0.541
	RCA	0.770	0.787	0.603	0.763	0.788	0.606
NYC	RCA w/o time	0.484	0.476	0.329	0.480	0.458	0.318
	RCA w/o geo	0.467	0.571	0.403	0.467	0.579	0.409
	RCA	0.865	0.845	0.657	0.867	0.848	0.660

Both geo and temporal features contribute significantly to performance improvements, with temporal proximity often playing a more dominant role. This indicates that flow patterns across regions are more strongly influenced by temporal periodicity, while also being shaped by static geographic features.

W3 & Limitation 2:

1) Can CRAFT utilize non-grid data?

Yes, actually datasets used in our paper is in graph format, where each node is a grid. CRAFT is not limited to grid-based data. It only requires that each basic geographic unit—regardless of its shape—provides complete information of POIs, roads, and population. In practice, such units are usually implemented as grids.

2) Can CRAFT generalize to more diverse cities?

We have added experiments on the Traffic4Cast[3] dataset, which contains vehicle flow spans diverse countries across Europe and Asia and exhibits greater inter-city variation. In addition, Traffic4Cast offers larger-scale and finer-grained data than the datasets used in our main paper. Specifically, we selected Moscow and Berlin to evaluate CRAFT. The results are shown in the table below.

Target City	Method		Inflow			Outflow
		CPC	NMAE	NRMSE	CPC	NMAE	NRMSE
Berlin	GMEL	0.500	0.442	0.501	0.487	0.470	0.528
	DiT	0.436	0.353	0.444	0.453	0.494	0.571
	Diffwave	0.424	0.562	0.662	0.386	0.710	0.768
	DDPM	0.388	0.333	0.395	0.389	0.342	0.414
	CRAFT	0.537	0.312	0.381	0.537	0.313	0.381
Moscow	GMEL	0.182	0.357	0.431	0.254	0.345	0.418
	DiT	0.566	0.403	0.480	0.500	0.320	0.396
	Diffwave	0.170	0.397	0.469	0.486	0.530	0.594
	DDPM	0.338	0.334	0.409	0.334	0.334	0.409
	CRAFT	0.609	0.239	0.315	0.607	0.240	0.316

The experimental results show that CRAFT not only generalizes well to the traffic4cast dataset but also achieves larger improvements over baselines compared to our original datasets, highlighting its strong generalization capability. We sincerely thank you for the suggestion and will include these results in the revised version.

[3] NeurIPS traffic4cast 2021

Q1 & Limitation 1: Enriching geographical features via network topology and intersection connectivity

Thank you for your comment. We computed each node’s space syntax features[4], including degree, closeness centrality, and betweenness centrality, to capture explicit road topology and intersection connectivity. Results are shown in the table below, where CRAFT + syntax denotes CRAFT augmented with explicit topology and intersection connections.

Target City	Method		Inflow			Outflow
		CPC	NMAE	NRMSE	CPC	NMAE	NRMSE
CHI	CRAFT	0.785	0.140	0.216	0.786	0.140	0.216
	CRAFT+syntax	0.814	0.121	0.187	0.815	0.123	0.189
DC	CRAFT	0.815	0.158	0.240	0.816	0.159	0.240
	CRAFT+syntax	0.800	0.165	0.246	0.799	0.168	0.247
Toronto	CRAFT	0.804	0.178	0.267	0.804	0.179	0.268
	CRAFT+syntax	0.827	0.159	0.239	0.827	0.159	0.240
NYC	CRAFT	0.782	0.103	0.170	0.786	0.102	0.165
	CRAFT+syntax	0.779	0.124	0.194	0.775	0.113	0.188

Experimental results show that space syntax features provide certain gains on the CHI and Toronto datasets, but lead to slight performance drops on DC and NYC. This is because DC and NYC have the simplest and most complex road network structures, respectively, among the four datasets.

Specifically, flow magnitude is primarily determined by population size, while flow patterns are shaped by regional functionality. When cities have similar road network complexity, incorporating syntax features improves performance. However, when complexity differs, such features may impede cross-city spatial alignment. In our original paper, to ensure generalizability, we used only the most basic spatiotemporal features shared across all cities.

[4]Space Syntax. Environment and Planning B: Planning and design

Q2 & W3 & Limitation1: Can CRAFT handle sudden changes in flow patterns, such as a holiday surge?

As existing public flow datasets lack information on incidents or extreme events. We constructed a dataset with sudden flow pattern changes based on the four datasets used in our paper.

Specifically, we kept all weekday (Monday to Friday) data in the training set, while using 80% of weekend and holiday data as the test set and adding the remaining 20% back into training. Additionally, we included a 0-1 one-hot vector as an event code to distinguish weekdays from holidays. We evaluated the impact of this event indicator on the RCA module’s performance, which is evaluated by similarity to the ground-truth flow in the target city using CPC, NMAE, NRMSE metrics. Results are in the table below:

Target City	Method		Inflow			Outflow
		CPC	NMAE	NRMSE	CPC	NMAE	NRMSE
CHI	RCA w/o event indicator	0.768	0.146	0.218	0.763	0.150	0.221
	RCA	0.800	0.123	0.190	0.802	0.123	0.189
DC	RCA w/o event indicator	0.799	0.164	0.235	0.798	0.166	0.239
	RCA	0.810	0.150	0.218	0.805	0.155	0.224
Toronto	RCA w/o event indicator	0.779	0.198	0.295	0.773	0.205	0.299
	RCA	0.783	0.188	0.279	0.788	0.185	0.275
NYC	RCA w/o event indicator	0.769	0.103	0.167	0.764	0.107	0.171
	RCA	0.789	0.092	0.149	0.790	0.093	0.149

Thank you for your detailed advice. The event indicator indeed improves the model’s ability to handle abrupt changes in flow patterns, such as holiday surges, thus enhancing CRAFT’s capacity for such scenarios. We will further extend this result and include it as a case study in the revised version.

Q3: The computational and memory costs when scaling to larger datasets

To evaluate computational cost, we tested the model’s inference time on progressively larger portions of the data—20%, 40%, 60%, 80%, and 100%—as shown in Table below.

Data portation	Train time per epoch	Valid time per epoch	Avg memory (GB)	Peak memory (GB)
0.20	7.362	1.032	0.343	0.553
0.40	15.383	2.223	0.345	0.553
0.60	24.294	3.177	0.354	0.553
0.80	31.069	4.263	0.347	0.553
1.00	39.687	5.294	0.345	0.553

Next, with the data scale fixed, we gradually increased the sequence length and measured inference time. Results are reported in Table below.

Seq leangth	Model size (Byte)	Train time per epoch	Valid time per epcoh	Avg memory (GB)
24	51781096	40.999	5.685	0.343
48	51830248	41.156	5.531	0.344
72	51879400	43.729	6.053	0.344
96	51928552	44.555	5.631	0.345
120	51977704	46.712	6.429	0.345
144	52026856	48.883	6.666	0.345
168	52076008	51.896	6.351	0.346

Combining the above tables, we observe that our model's computational cost grows linearly with data scale and is only slightly affected by sequence length. Memory cost is also insensitive to both data scale and sequence length.