Response to W6 and Q4: We thank the reviewer for highlighting fairness and efficiency concerns. Below we respond to each point:

(1) Training Epochs

We would like to clarify that this discrepancy was not intended to create an unfair comparison. In our experiments, each method was trained following the same hyperparameter settings provided in their papers or official implementations. Each method was trained until convergence based on validation performance, and different models exhibit different convergence behaviors.

Our model operates under a multi-city, multi-view setting, which introduces greater data heterogeneity compared to single-city baselines. It requires more training steps to reach convergence. The decision to train SMARTraj^2 for 70 epochs is not arbitrary—it was guided by validation loss stabilization.

To ensure fairness, we re-trained key baselines (e.g., MVTraj, JGRM) for up to 70 epochs: (a) Their performance did not improve significantly beyond 30 epochs, confirming their early convergence. (b) Training SMARTraj^2 for only 30 epochs led to notable performance drops, underscoring need for extended training under more complex settings.

Since figures are not permitted in the rebuttal, we will provide convergence curves in the final version.

(2) Efficiency Trade-off and Broader Improvements

We recognize that SMARTraj^2 introduces higher computational cost due to its multi-view fusion and cross-city modeling. However, this design enables stronger generalization across diverse urban environments, which is the primary goal of our work.

While SMARTraj^2 has a longer inference time (11.65ms vs. 4.22ms), it achieves consistently stronger performance across all tasks (Table 1 in Sec 4.2, and Table 4 in Appendix B.2.1).

Moreover, this advantage becomes more pronounced in low-resource or cross-city scenarios. Unlike JGRM, which must be trained from scratch for each city, SMARTraj^2 can be pretrained on multiple source cities and then efficiently fine-tuned on a new city using limited data, thanks to its transferable representations. This not only improves generalization but also significantly reduces total cost of training and data annotation when scaling to multiple cities.

We believe this is a reasonable and deliberate trade-off in scenarios where cross-city adaptability and robust generalization are more critical than marginal inference latency — such as nationwide logistics, urban planning tools, or multi-city ride-sharing platforms.

Nonetheless, we acknowledge that inference efficiency is important in latency-sensitive applications. We will include this limitation, along with future optimization directions (e.g., model distillation or lightweight architectures), in a dedicated "Limitation and Future Work" section.

(3) Performance Attribution: GPS vs. Architecture

We agree that GPS view offers rich information. However, our ablation study demonstrates that the performance gain is not solely from GPS inputs:

Method	Road Label		Travel Time		Destination Road		Destination Grid
	Micro-F1 ↑	Macro-F1 ↑	MAE ↓	RMSE ↓	Acc@1 ↑	Acc@5 ↑	Acc@1 ↑	Acc@5 ↑
SMARTraj^2	0.8407	0.8298	35.0689	60.9156	0.7409	0.9069	0.6675	0.8392
w/o grid	0.8233	0.8186	72.6226	105.6123	0.6604	0.8402	X	X
w/o GPS	0.7987	0.7832	73.2965	106.1142	0.5446	0.7667	0.4110	0.6351
w/o route	X	X	74.5902	106.7897	0.5924	0.8049	0.4311	0.6636
w/o gating	0.8387	0.8279	40.3469	68.6216	0.6787	0.8478	0.5244	0.7200
w/o invariant+specific	0.7415	0.7268	56.5380	71.5127	0.4770	0.7001	0.2556	0.5315
w/o gating + specific	0.7637	0.7574	54.8681	84.8157	0.6300	0.8294	0.5466	0.7636

This confirms that architectural components are critical to the improvements, and the performance cannot be attributed to GPS input alone.

(4) Clarifying Our Model Design Philosophy

SMARTraj^2 is not intended as a lightweight or efficiency-oriented model, but rather as a generalizable, scalable solution for multi-city trajectory modeling. Its modular architecture is specifically designed to disentangle and adapt to diverse urban structures. We believe this approach is valuable for real-world applications where transferability and scalability outweigh marginal efficiency costs. Our contribution lies in this flexible and stable design, beyond longer training or additional inputs.