PaperHub
5.3
/10
Rejected4 位审稿人
最低5最高6标准差0.4
5
5
5
6
3.8
置信度
正确性2.8
贡献度2.5
表达3.0
ICLR 2025

Generalizing Dynamics Modeling Easier from Representation Perspective

OpenReviewPDF
提交: 2024-09-13更新: 2025-02-05

摘要

关键词
Dynamics ModelingOrdinary Differential EquationsPre-trained Language Models

评审与讨论

审稿意见
5

The paper introduces a method for leveraging observations from multiple systems to create a generalized model that captures shared dynamics across these systems in a common latent space. Their method, PDEder, builds on pre-trained language models, adapted to specific dynamic observations through tokenization and fine-tuning, which enables predictions of future dynamics. The authors evaluate their model on 18 dynamical systems, covering both long- and short-term forecasts.

优点

The paper is well-written and tackles a significant challenge in modeling time series obtained from multiple systems. It leverages recent advances in language models and emphasizes generalizability, which is an important quality for such models.

缺点

Critical:

  1. I am concerned about the main assumptions the paper relies on. Are you assuming that different systems come from the same statistics or share fundamental dynamics? I would argue that systems from completely different worlds, time scales, and dynamical regimes should not necessarily be trained together when the overlap in their behaviors is minimal. There is an assumption the authors rely on that needs better quantification regarding how much the systems can differ in terms of dynamical regimes and time scales. Even within the same field and data modalities, recordings from different subjects often differ significantly, indicating that they should not always be learned together. Across fields, I am concerned this issue is more pronounced and must be addressed more carefully both mathematically and as a discussion with a clear list of assumptions.

  2. Regarding the previous point, in Line 044 you mention generalizability across domains. While models should indeed, broadly. be generalizable, they must also capture the unique characteristics of different domains. Therefore, I believe the trade-off between generalizability, expressivity, and interpretability needs to be more thoroughly addressed.

  3. Additionally, more papers should be discussed in the related work, including works on neural dynamics that leverage multi-session information via shared dynamical priors [1] or transformers [2] for inferring dynamics.

  4. I am also concerned about the interpretability of the model, which you barely discussed. When using deep learning for scientific purposes, we want to ensure that our model parameters and latent variables are interpretable. However, with the use of pre-trained language models and fine-tuning across observations, it seems the model lacks interpretability. How would the authors address this?

  5. It is not clear how you calculate the graphs (i.e., G\mathcal{G}) for the real-world data. Is it known, or do you calculate it during pre-processing?

  6. It is unclear how robust the system is to hyperparameter choices. Please discuss or explore this.

  7. Additionally, it is unclear how you train/fit Eq. 3 in practice. Is it via EM or global optimizers? Please include an algorithm.

  8. Many results are presented in the supplementary materials. While I understand that the page limit sometimes necessitates this approach, the authors did not use the full page limit in this case. Why not include more results in the main text? I suggest incorporating additional results (e.g., from Appendix B) into the main text.

  9. I think an important question that needs discussion is how many observations you need for the method to succeed. I would assume that performance will improve rapidly with a low number of observations and then level off as more are added. Do you have any quantification of that?

Minor:

  1. In the abstract, you used "neural Ordinary Differential…" without capitalizing "N"; however, in the introduction, it’s capitalized. Please be consistent.

  2. Line 052: Change "where the dynamics can be easier captured" to "where the dynamics can be more easily captured."

  3. While the paper is well-written overall, the first paragraph of the related work section reads more like a list than a motivation to identify the research gap. I suggest rephrasing it to better highlight the gap and clarify what your method aims to address.

  4. Line 147: There’s a missing period after the ODE subtitle.

  5. Line 192: The fourth word (We) should not be capitalized.

  6. Lines 209-211: This content seems out of place and might fit better in the related work section.

  7. Line 247: Did you mean "serve" instead of "sever"?

  8. Line 370: The last word, "we", should be capitalized.

  9. Table 5: Why is the "%" sign only next to some numbers? Are all values percentages, or do those with "%" represent a different scale (1/100)? This needs clarification.

References:

[1] Mudrik, N., Ly, R., Ruebel, O., & Charles, A. S. (2024). Creimbo: Cross ensemble interactions in multi-view brain observations. arXiv preprint arXiv:2405.17395.

‏ [2] Liu, R., Azabou, M., Dabagia, M., Xiao, J., & Dyer, E. (2022). Seeing the forest and the tree: Building representations of both individual and collective dynamics with transformers. Advances in neural information processing systems, 35, 2377-2391.‏

问题

  1. What happens if some observations are from a different distribution than the others?

  2. What is the level of similarity you assume across systems? Please clarify the assumptions regarding the level of similarity required across different systems for your model to be effective. Are you suggesting that systems must share certain statistical properties or fundamental dynamics?

  3. What is the model's computational complexity, and how does it scale with the number of observations?

  4. How do you choose the system-specific parameters?

  5. Can you clarify the dimension of x~in\tilde{x}^\text{in}? Is it equal to 1? Additionally, please specify the dimension of WdpsW^s_{dp} in the Data Projection section.

  6. Why was the architecture explained in Eq. 2 chosen?

  7. Why did you choose the 1\ell_1 rather than 2\ell_2 loss? (line 258).

评论

Q1. About the assumption of sharing fundamental dynamics.

Thank you for this comment, however, there may be a misunderstanding. To explain this, we would review the full story of this work. Developing a generalized model, that can handle all dynamics, or at least many, can be a fundamental research goal for dynamics modeling [1-2]. Learning fundamental dynamics or hidden characteristics for various dynamics is a rather basic and tough problem. Therefore we adopt an easier and lighter way which first learns better representations for dynamics observations, and these representations can then be adopted to easily learn the hidden dynamics.

Therefore, we kindly argue that our main contribution lies in pre-training to learn better representations for downstream dynamics learning. Analogous to the usage of pre-trained language models, our pre-trained \baby concentrates on how to learn better dynamics-enriched representations. And the dynamics modeling module can be analogous to the classification head or regression head when fine-tuning a language model for downstream tasks. We want to highlight that, in the pre-training period, no system-specific dynamics are approximated, and the pre-training process only learn better embeddings for the observed sequences. And the interacting graphs are not considered in pre-training.

In this way, the generalizability of our model lies in the representation level, rather than the dynamics learning level. After pre-training the embedder, we can learn generalizable embeddings for observations for any dynamics system, and these embeddings could be used for approximating dynamics by fine-tuning with any specific dynamics modeling methods. In this way, our embedder pre-training process could benefit approximating dynamics models more easily.

[1] Lomax, H., Pulliam, T. H., Zingg, D. W., and Kowalewski, T. A. (2002). Fundamentals of computational fluid dynamics. Appl. Mech. Rev., 55(4), B61-B61.

[2] Luenberger D. Dynamic equations in descriptor form[J]. IEEE Transactions on Automatic Control, 1977, 22(3): 312-321.


Q2. About the generalizability and interpretability.

To express the interpretability of our proposal, we adopt a white-box dynamics learner SINDy to learn dynamics on the observation embeddings. Details are presented in the 2rd response in general comment.

PDEDERPDEDER+PDEDER+SINDy
short-termlong-termshort-termlong-term
MSEMAEMSEMAEMSEMAEMSEMAE
Mutualistic0.3620.4520.8090.6751.0141.0140.3340.334
Heat0.0030.0450.0060.0520.8860.8841.5771.586
2D CFD0.2230.3030.1520.2361.0010.9841.1391.164
DarcyFlow0.0010.0200.0010.0210.8580.8511.1031.104
Gene0.0350.1360.0760.1720.6130.5370.7830.783
ShallowWater0.6740.3581.1450.5270.5380.4631.0401.047
2D DiffReac0.9600.7231.0570.7940.1260.1260.8070.808

Q3. How to calculate the graph G\mathcal{G} for real-world data.

R3. Thanks for pointing out these missing information. For LA, SD, PEMS03, PEMS04, PEMS07 and PEMS08, the corresponding graph structure are provided by the original datasets. As for NYCTaxi, CHIBike, TDrive and NOAA, we calculate the graph structure by distances of the provided latitude/longitude or grid coordinates of each observation station. We added these introductions into the Appendix A.

评论

Q4. Robustness on the hyper-parameters.

We discuss and explore the robustness from two aspects, model-related parameters and data-related hyper-parameters.

As for model-related hyper-parameters, learning rates should not be too small comparing with directly fine-tuning a pre-trained language model on downstream tasks such as 1e71e-7 or 1e81e-8, etc. We finally chose 1e31e-3 for pre-training. The most possible reason is that the optimal embeddings space for better learning dynamics differs from which in language modeling to some extent. In our early attempts, we indeed observed that pre-training with smaller learning rates performs quite awful results, leading to rather poor fine-tuning results.

As for data-related hyper-parameters, the choices of patch length and stride are essential. In our early attempts, we found that the pre-training process are quite in-sensitive to these two parameters. Therefore, we chose moderate lengths following [5]. The fine-tuning process are also in-sensitive to them. For time limitation, we applied sensitivity studies on the fine-tuning process. Detailed results of sensitivity analysis are presented in our updated paper (see Fig.4 in Appendix). We will add more discussions on the sensitivity analysis in our future version.


Q5. About how to train the objective in Eq.3. Please include an algorithm.

Thanks for pointing out this missing part, we added a section of model training and an algorithm to clarify the overall pre-training and fine-tuning processes. The additional introductions are listed below and we also added it in the PDF version (see Section 4.4 and Algorithm 1 and 2).

Model Training We first pre-train \baby on all collected dynamics observations (without graph) with Eq.3 for EpE_p epochs. To handle the massive observations and various numbers of samples on different systems, we randomly choose 1010 dynamics systems for each training round and train \baby for 55 epochs with all the observations from these systems. When learning a specific dynamics, we fine-tune \baby with Eq.6 for EfE_f. The training details are presented in Algorithm 1 and 2.


Q6. About the full page limitation.

Following your suggestion, we reformatted our paper to use the full page limit.


Q7. How many observations for the method to succeed.

We kindly argue that the fine-tuning of our pre-trained PDEDER on all systems are essential for examining the effectiveness of the pre-training process. Specifically, we set several model variants for PDEDER, such as fine-tuning without our pre-training process or fine-tuning with freezing our pre-trained PLM parameters. In this way, we can explore the exact effect of pre-training on each of the system.


Q8. About the minor weaknesses.

Thanks for your careful comments. Following your suggestions, we corrected all the mistakes and checked our paper carefully to rectify mistakes.

  • Due to the time limitation, we will rewrite the whole passage of related works on dynamics modeling to make it clearer in our future version.

  • We rewrote the paragraph of data projection. Details are presented in general comment and response for Q12.

  • The missing period are caused by the latex format automatically. We will reformat the paper layout as you suggested.

  • % denotes these numbers are in a different scale (1/100). We scale numbers which are rather small by 1/100. For example, the value of "0.067%'' is "0.000673286''. We added the descriptions "% denotes the results are scaled by 1/100.'' in the table captions in our updated version.


Q9. Distribution differences of observations.

In both of the pre-training and fine-tuning progresses, we use an instance normalization layer to handle the distribution shifts following [6].


Q10. How to choose the system-specific parameters.

Following [7], we directly adopt the default parameters on systems which are cover by it. For other systems, we resort to their original papers and chose various parameters which are close to the default parameters.


Q11. The dimension of x~(in)\tilde{x}^{(in)}.

The details about data projection including the dimension of x~(in)\tilde{x}^{(in)} have been presented in the general comment above and edited in our latest version.


Q12. The choice of architecture in Eq.2.

Apart from the basic encoder and decoder of a PLM, we adopt a convolutional layer for encoding and a flatten-and-linear layer for decoding following previous works [6,8].


Q13. The choice of objective function.

We chose 1\ell_1 loss following NDCN[9], one of the most representative neural dynamics modeling method. We also tried the 2\ell_2 MSE loss during empirical studying, and the results show little difference.


评论

Q14. Computational complexity and scaling on number of observations.

The computational complexity consists of the several following parts: data projection, encoding by convolutional layer, fine-tuning with LM, decoding by linear layer and the integration approximation in Neural ODE. The detailed computational complexity is denoted as O(M(s(N2H+NH2)+2NLp2V+3HLpN2HLp+LH+LH2))O(M(s(N^2H+NH^2)+2NL_p^2V+3HL_pN-2HL_p+LH+LH^2)), where MM denotes the number of observation sets; NN denotes the number of objects in one system; LpL_p denotes the patch length; VV denotes the system-specific dimension; HH denotes the hidden dimension of PLM; LL denotes the number of layers in PLM; ss denotes the solving step in Neural ODE. We can find the complexity is linear with the number of observations sets; and quadratic to the number of objects, which is caused by the GNN layer in Neural ODE. In our practical studies, the runtime of our proposal is faster than baselines, which are introduced above (see Q5 in Response to Reviewer #1 (B9zD) (3/3)).


[5] Zhou, Tian, et al. One fits all: Power general time series analysis by pretrained lm. Advances in neural information processing systems 2023, 36: 43322-43355.

[6] Tian Zhou, Peisong Niu, Liang Sun, Rong Jin, et al. One fits all: Power general time series analysis by pretrained lm. Advances in neural information processing systems, 36:43322–43355, 2023.

[7] Takamoto, Makoto, et al. Pdebench: An extensive benchmark for scientific machine learning. Advances in Neural Information Processing Systems 35 (2022): 1596-1611.

[8] Ching Chang, Wen-Chih Peng, and Tien-Fu Chen. Llm4ts: Two-stage fine-tuning for time-series forecasting with pre-trained llms. arXiv preprint arXiv:2308.08469, 2023.

[9] Chengxi Zang and Fei Wang. Neural dynamics on complex networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 892–902, 2020.


We hope to hear back from you if you have further questions.

评论

Dear Authors,

First, I apologize for my delayed response and thank you for your patience. I appreciate your comments and the additional changes you made to the PDF.

  1. In your response, you wrote, that can handle all dynamics, or at least many, can be a fundamental research goal for dynamics modeling [1-2]. This is exactly the issue I raised. the usage of all dynamics, I assume your approach is tailored towards observations with noise that follows normal distribution? What are your assumptions about the statistics and the nature of dynamics? The phrase "all dynamics" is too vague. For instance, if the data comes from a Poisson distribution, will the model still work? Or if it includes high-frequency noise (e.g., LFP recordings)? It is important to clearly state the statistics and conditions of the dynamics you are focusing on. The references you cited address fluid and autonomic control dynamics, which have their own properties, but may not represent all dynamics.

  2. Regarding interpretability, while applying SINDy to the embeddings is an interesting idea, I do not see how the table you provided teaches us about the interpretability or goes beyond error quantification. A key advantage of SINDy is its ability to decompose a system into basic functions or components—not just considering reconstruction accuracy as the primary metric. Could you analyze the SINDy weights and provide some interpretations? I understand that interpreting complex, multi-way data is difficult, but offering insights into the components is critical when applying this to scientific problems.

  3. It seems like you also skipped my critical point 3. If your approach is not intended for neural dynamics, that makes sense, but it should be explicitly stated. Otherwise, please explain why neural dynamics or other works from the field are not relevant.

  4. If the graphs are required or need to be calculated from labeled data, this should be stated explicitly in the text. Many real-world dynamical datasets lack pre-defined graphs or labels for graph construction, so this need should be acknowledged clearly.

  5. On Q8, why not use standard scientific notation (e.g., (e^(-3)) ) then?

  6. You also discussed the instance normalization layer. While it addresses certain issues, what happens if the data follows a Poisson distribution? Again, I do not expect a method to handle every type of data, but you should explicitly outline the statistical assumptions about the data. Hence this does not answer the question "What happens if some observations are from a different distribution than the others?".

  7. I do not understand the response to "Q7. How many observations for the method to succeed.", or that is not what I asked. Assume you have only one dataset for pre-training, and then you apply it to a different dataset for fine-tuning. I assume the advantage of pre-training on the first dataset depends on how similar the two datasets are. As the number of datasets for pre-training increases, the pre-trained model should become more robust. There must be some assumption about the relationship between the number of datasets, their similarity, and how similar they are to the datasets you aim to apply them to later. What is the scale of the number of datasets for training needed to achieve a robust pre-trained model and how it changes with more or less datasets? even if you do not provide an experiment for that, some explanation of this effect is important I believe.

  8. while including L1 regularization in the cost function makes sense, it is important to explain (also in the paper) why you chose it, as L1 promotes sparsity (unlike L2), which is probably the reason you observed different results under these penalties.

评论

Thank you so much for your patient comment.

Q1&6. About the fundamental dynamics and dataset statistics assumptions.

We are sorry that we didn’t present our motivation clearly in our latest response. We kindly argue that we don’t want to learn a fundamental dynamics model to handle all dynamics, which is a rather tough and challenging problem. Therefore, we didn’t make assumptions on the observation statistics. Rather than explicitly learning the shared governing basic dynamics, we want to learn dynamics in an easier and lighter way, which learn generalizable embeddings to better approximate dynamics in downstream task.


Q2. About analyzing the SINDy weights.

Thanks for your advice. We will add this empirical study in our future version.


Q3. About adding discussions.

We are sorry for missing this question. We will add more discussion about the related work as you suggested. Besides, we kindly argue that any dynamics modeling method could be appended to our proposed PDEDER for fine-tuning to learn certain dynamics, including both neural and shallow, discrete and continuous methods, etc.


Q4. About graph calculation.

Thanks for your advice. We will add specific details about how to calculate graph for each system. For all systems we adopted, the graphs are either provided by the original dataset, or calculated by a grid network of each object node.


Q5. About the formats of small values.

Thanks for your advice. We will reformat the results in the standard scientific notation in the next version.


Q7. About the pre-training datasets scale.

Thank you so much for your detailed comment. We are sorry for misunderstanding this question. In our early attempts, we designed several empirical studies to examine the effect of datasets scale used in pre-training. Due to time limitation, we only examined the effect of leaving one system and one parameter out. And the results show more pre-training datasets perform better and the two LOO versions also show competitive performance. We will try to explore more about this task in our future version.


Q8. About the chosen of objective function.

The L2 loss may be more sensitive to some outlier values comparing with L1 loss, and may lead to less penalty on those normal values, especially for the realistic systems, where the ground truth after de-normalization may be rather large.


Thank you so much for you valueable comments which helps us a lot to improve our paper. We will revise our paper carefully as you suggested.

审稿意见
5

This paper introduces PDEDER (Pre-trained Dynamic EncoDER), a method to generalize dynamics modeling by embedding original states into a latent space using pre-trained language models. The key idea is to pre-train an encoder using language models that can embed states from various dynamic systems into a latent space where dynamics can be more easily captured and fine-tuned for specific systems. The key contribution is a framework that pre-trains 153 sets of observations from 24 complex systems, followed by fine-tuning for specific dynamics. The method employs data projection, tokenization, and PLM-based encoding to learn dynamics-enriched embeddings. The authors evaluate 18 dynamic systems through long/short-term forecasting under in- and cross-domain settings.

优点

  • Novel approach using pre-trained models for dynamics modeling generalization
  • Comprehensive dataset collection across multiple domains
  • Clear empirical validation through both in-domain and cross-domain experiments
  • Strong cross-domain performance after fine-tuning, even when excluding entire systems during pre-training

缺点

  • Difference between short-term and long-term forecasting in results is not defined
  • The connection between the pre-training objectives and dynamics modeling is not formally established
  • Experimental limitations:
    1. The ablation studies don't fully isolate the contribution of each component
    2. Comparison with TANGO is limited to small-scale systems due to memory constraints. It would be useful to provide memory complexity analysis, Discuss potential solutions for scaling to larger systems
    3. The selection of 24 systems lacks justification for representativeness
  • Technical details requiring clarification:
    1. How is the data projection module's dimension chosen?
    2. What is the impact of different PLM architectures?
    3. How sensitive is the method to tokenization parameters?
  • The core idea of using pre-trained models for time series/dynamics has been explored in recent works (as cited in Related Work section, e.g., Gruver et al. 2024, PromptCast, AutoTimes). While this paper applies it to dynamics modeling in a new way, the conceptual novelty is incremental Minor comment:
  • Memory requirements should be discussed earlier in the paper

问题

  • Why was T5 chosen as the base PLM? Have other architectures been considered?
  • The data projection module reduces dimensionality to 1, but the justification for this choice isn't clear. Could this limit the model's expressiveness for complex systems?
  • The data projection module seems crucial for handling different state dimensions. How do you determine the optimal projection dimension?
  • The tokenization process (patch length 30, stride 6) seems to work well empirically, but how sensitive is the model to these choices? Some analysis of this would be valuable.
  • Could you provide theoretical analysis/insights on why the pre-training objectives (reconstruction and forecasting) are sufficient for learning dynamics-enriched embeddings?
  • How does the choice of T5 as the base PLM impact results? Have you tried other architectures?
  • For the cross-domain experiments, how do you ensure the held-out systems are sufficiently different from the training systems? What properties of the dynamics are preserved in the embedding space?
评论

Q1. Difference between short-term and long-term forecasting

Thanks for pointing out this issue. The differences are presented below. We also rectified the corresponding descriptions in our paper (see Implementation Details in 5.2).

Short/Long-term Forecasting The training sequence length are same for both short and long term forecasting. For NYCTaxi, CHIBike, TDrive, PEMS03, PEMS04, PEMS07, PEMS08 and NOAA, the short- and long-term forecasting lengths are set as {24,48} and {96, 192, 336, 720}. For the rest dynamics, due to the diversity of convergence characteristics on each system, we truncate the test sequences by ratios to form the short/long-term forecasting sequences. The ratios for short- and long-term are set as {10%, 20%} and {50%, 70%, 80%, 100%}, respectively. For example, when the test sequence length is 200, we set 10% * 200=20 and 20% * 200=40 as the forecasting lengths.


Q2. The connection between the pre-training objectives and dynamics modeling.

Thanks for pointing out this issue. We added a section to describe the model training processes detailedly and an Algorithm to clarify the connections of the two processes (see Section 4.4 and Algorithm 1 and 2).


Q3. Ablation study don't fully isolate the contribution of each component.

We are grateful for being pointed out this issue. We renamed this section from "Ablative Study" into "Impact Evaluation of Pre-training on downstream Dynamics Modeling". In this section, we examined the impact of our pre-training process on the downstream dynamics modeling by setting two different initialization strategies on en/decoder parameters in fine-tuning. The two versions corresponds with 1) fine-tuning PDEDER without pre-training and 2) fine-tuning PDEDER with freezing our pre-trained en/decoder.


Q4. Memory complexity discussions.

We kindly argue that our proposal requires less memory comparing with baseline methods. For example, fine-tuning on Mutualistic requires less than 2500 MB memories with patch length 50 and stride 10.


Q5. The selection of 24 systems lacks justification for representativeness.

We collected systems which are representative and commonly used in dynamics modeling. For variety, we collected systems from various domains, including physics, fluid, biology, climate and traffic system. These dynamics have been widely used in researches of dynamics modeling.


Q6. Data projection module's dimension

We modified this section in our latest version, and details are presented in the general comment above.


Q7. Sensitivity of tokenization parameters (patch size and stride).

We added sensitivity analysis on patch length and stride in our latest version. Details are presented in the general comment above and updated in our latest PDF version (see Figure 4 in Appendix).


Q8. The conceptual novelty is incremental against the pre-trained models for time series forecasting.

We kindly argue that our main contribution lies in pre-training to learn better representations for downstream dynamics learning rather than directly fine-tuning the pre-trained models. Our \baby concentrates on how to learn better dynamics-enriched representations. And the dynamics modeling module can be analogous to the classification head or regression head when fine-tuning a language model for downstream tasks. We want to highlight that, in the pre-training period, no system-specific dynamics are approximated, and the pre-training process only learn better embeddings for the observed sequences. And the interacting graphs are not considered in pre-training.

In this way, the generalizability of our model lies in the representation level, rather than the dynamics learning level. After pre-training the embedder, we can learn generalizable embeddings for observations for any dynamics system, and these embeddings could be used for approximating dynamics by fine-tuning with any specific dynamics modeling methods. In this way, our embedder pre-training process could benefit approximating dynamics models more easily.


Q9. The reason of choosing PLM architectures. other PLMs.

We consider to choose PLMs owning encoder and decoder according to the basic idea of embedding the original states into latent space and learn dynamics in it. It could be substituted into any other suitable PLMs, even without a decoder, which may leading to the posterior collapsing. We will try to explore the effects of PLMs architectures on downstream dynamics modeling in our future work.

评论

Q11. For the cross-domain experiments, how do you ensure the held-out systems are sufficiently different from the training systems?

When generating observations, we ensure the difference by setting quite different system-specific hyper-parameters, number of objects and sequence lengths. Dynamics behaviors strongly rely on the system-specific hyper-parameters. And we indeed observed obvious distinctions on systems with different parameters.


Q12. Theoretical analysis/insights on why the pre-training objectives (reconstruction and forecasting) are sufficient for learning dynamics-enriched embeddings. What properties of the dynamics are preserved in the embedding space?

We mainly focus on the forecasting ability in the embedding space. Apart from the basic reconstruction task, to enhance the capacity of forecasting is essential when extracting the hidden evolving regularity of hidden dynamics. It can directly reflect how the model describe a specific dynamics and is an simple-yet-essential validation strategy measuring the quality of an approximated dynamics. And forecasting is one of most commonly used capacity in the real-world applications of dynamics models. Therefore, the pre-training objectives intuitively make sense for learning dynamics-enriched embeddings.


We hope to hear back from you if you have further questions.

评论

Thank you for the thorough response to my review. I cannot increase my score of 5, as I still have several concerns which are not addressed:

  1. The new MRAE results (>100% errors in many cases) indicate more limited practical effectiveness than MSE/MAE metrics initially suggested.
  2. The data projection module's flattening approach still lacks theoretical/empirical justification. Your response describes the method but does not explain why dimension reduction to 1 is appropriate.
  3. Memory complexity analysis needs to be more comprehensive beyond just a single example.
  4. Renaming "Ablative Study" doesn't address the need to isolate component contributions properly.
  5. The theoretical foundation linking pre-training objectives to effective dynamics-enriched embeddings remains weak.

While you've improved documentation and added experiments, these fundamental issues affect the paper's potential impact on the field.

评论

Thank you so much for your professional and valuable comments.

Q1. About the extremely large MRAE.

R1. In the past few days, we carefully compared the source codes and experimental settings between our method and baselines which adopt T-Drive and other realistic datasets as benchmark. We found that the extremely large MRAE is caused by the calculation mode on the missing values (presented as 0 in the dataset). For example, more than 20% data points are missed and presented as 0 in T-Drive, CHIBike and NYCTaxi. The original baseline methods masked the missing targets and therefore perform non-abnormal results. While we didn’t mask these missing values and calculate MRAE by mean(y^yy+1e8\frac{|\hat{y}-y|}{|y+1e-8|}). Therefore, the existence of zero targets leads to rather large results. To sovle this, following STGODE, we masked the missing values and re-computed the MRAE results. Due to time limitation, we present the detailed results of our model variants on T-Drive, CHIBike and NYCTaxi below.

PDEDERPDEDER-noprePDEDER-frz
TDrive0.3950.3410.425
CHIBike0.7440.7120.702
NYCTaxi0.4010.3640.428

Q2. About the data projection module.

We kindly argue that the data projection module mainly acts as a prefix to align the data dimension across different systems. In fact, the patched tokens could be projected into any dimensions by any propoer layers. Here we choose the simple linear layer to project it into one token for generalizable downstream learning.


Q3. About the memory complexity.

The memory complexity is denoted as O(Pdp+Pc+Pe+sH+Pg+Pm(Pd+Pr))O(P_{dp}+P_c +P_e + sH+P_g + P_m\cdot(P_d+P_r)), where Pdp,Pc,Pe,Pd,Pg,PrP_{dp}, P_c, P_e, P_d, P_g, P_r denote the model parameters number of the data projection module, the convolutional module, the encoder of PLM, the decoder of PLM, the GNN module and the reconstruction module, respectively. PmP_m denotes the number of patches.


Q4. About the ablative study.

We kindly argue that this part is not a standard ablative study in the strict sense and we aim to examine the effect of pre-training on downstream dynamics modeling. Therefore, we examine this by setting and comparing with the model variants of fine-tuning without pre-training and freezing the pre-trained encoder/decoder.


Q5. About theoretical foundation of the pre-training objectives.

We kindly argue that we mainly concentrate on enhancing the forecasting capacity when learning dynamics in this early attempt. We will try to incorporate dynamics-specific objectives in our future work.


Thank you so much for your valuable comments which helps us to improve our paper. We will carefully revise our paper as you suggested in our future version!

审稿意见
5

The authors propose a neural system, the Pre-trained Dynamic Encoder (PDEDER), which is pre-trained on observations from various dynamic processes that unfold on graphs. They evaluate this model on a variety of long- and short-term forecasting tasks, both within domains (in-domain) and across domains (cross-domain).

Definition of benchmark: -153 sets of observations from different dynamical systems: 122 synthetic datasets from 14 dynamical systems and 31 real-world datasets from 10 dynamical systems.

  • Each dynamical system has Ms sets of observations with different parameters. Observations are multivariate, capturing data for each node in the system across different time steps.
- Dynamical systems include: Springs, Mutualistic interactions, Heat diffusion, various Fluid dynamics, Biology, Climate, and Traffic, with system sizes ranging from 5 to 1024 nodes, timestamps from 100 to 28,000 steps, dimensionality from 1 to 10, sample counts from 1 to 10,000, and varying hyperparameters (from 1 to 15) for each dynamical system.
- Each set of observations is temporally divided into in-sample and out-of-sample portions. PDEDER is trained to reconstruct in-sample data and forecast out-of-sample data.

Tokenization: to handle observations from various lengths, sub-observations are created to the fixed patch length and number of patches with specific stride length R. Additionally, Gaussian noise and instance normalization is done on each patch.

Authors use system-specific linear projection layer.

Learning Process: The projected data is used to reconstruct input states and to perform forecasting using a pre-trained language model (T5 model). The model architecture includes a convolutional layer for encoding, a PLM encoder, and a PLM decoder with additional linear adapters to aid in reconstruction and prediction. The loss function is an L1 loss applied to both the reconstruction and prediction components.

To model the evolution of dynamics, the authors employ a Graph Neural Network (GNN) with a single-layer normalized Laplacian and a trainable linear layer to encode the infinitesimal changes in the system state. Integration is then applied to derive the evolution of the hidden state, which is decoded by the decoder component of PDEDER.

They use 4 baselines: NDCN Zang & Wang (2020a), ST-GODE Fang et al. (2021), MT-GODE Jin et al. (2022) and TANGO Huang et al. (2024b).

Authors show results for short term/long term forecasting, in-domain and cross-domain setting.

As an active researcher in this field, it is challenging to assess the validity of the model without visual representations of the dataset’s dynamics. Showing figures that capture these dynamics would greatly improve the clarity and reliability of the experimental evaluation.

优点

  1. Work with large number of dynamical systems
  2. Single model (modulo projection layer) that aims at reconstructing and forecasting dynamical systems is very hard.

缺点

Experimental setting is obscure and not written clearly (see and address questions for baselines, evaluation metrics, visualization of time-series forecasts vs ground truth).

Evaluations metrics: Error Metrics for Dynamical Systems: Metrics such as MSE and MAE may not fully capture the performance of models on dynamical systems, as they might obscure certain dynamics-specific behaviors. Baselines should be better anchored to the dynamics with a simple, interpretable baseline for comparison. Try to include mean relative absolute error Mean[ |(y_hat - y)/y| ] . Add another simple baseline for dynamics: e.g. prediction is last value plus numerical estimate of derivative plus some time series baselines.

Baselines used are focused on GNN-type models for dynamical systems.

Why Language model like T5 should be used for dynamical systems? If I am right, you have re-used language model? Provide more intuition why do you believe it has valid grounds e.g. what kind of biases for transfer learning do you see in this pre-trained model?

问题

  1. Forecasting Task Visualizations: It would be helpful to include figures illustrating the in-domain and cross-domain forecasting tasks. Specifically, showing a time series up to a certain point in time, followed by model forecasts alongside the ground truth values, would provide valuable insights.

    Limitations of Tables for Dynamical System Predictions: Tables alone do not clarify which dynamical regimes are being predicted. There is a possibility that only simple, easily predictable regimes are being tested. Visuals displaying different regimes in the time series would help clarify the difficulty of the forecasting tasks being evaluated.

  2. Use of Benchmark Models for Forecasting:

    Incorporating Established Forecasting Models: It would strengthen the analysis to include well-known models for time-series forecasting in the experiments, such as SOTA models from recent M-competitions (e.g., Smyl, Slawek, Grzegorz Dudek, and Paweł Pełka. "ES-dRNN: a hybrid exponential smoothing and dilated recurrent neural network model for short-term load forecasting." IEEE transactions on neural networks and learning systems (2023)). These models could serve as benchmarks for comparison, adding credibility to the experimental findings.

  3. Novel Initial Value Conditions in Forecasting Tasks:

    Generalizability Across Initial Conditions: The current experimental setup does not appear to include tests on dynamics with novel initial conditions? This raises concerns about the model's reliance on specific initial values. Evaluating the model’s performance on dynamics with varied initial conditions would clarify whether it is generalizable or inherently tied to specific initial states.

  4. Discuss potential advantages or relevant biases from language models that may transfer well to dynamical systems modeling.

评论
SystemGNSNDCNSTGODEMTGODEPDEDERPDEDER-noprePDEDER-frzPDEDER-sys
NOAA2417.7988.8173.1295.07417.03115.15019.17617.995
4820.41513.5852.6826.26321.90417.11321.11222.900
9622.06515.5192.8535.62326.82922.45023.30727.712
19221.80515.6164.4336.47224.32120.79522.59824.067
33621.30914.8233.7356.71422.11317.66320.31821.997
72021.12913.4663.7606.31216.53413.29817.63515.774

Q2. About the novel initial value conditions when forecasting.

We are grateful for being pointed out this missing issue. In practice, we generate MsM_s set of observations with random initial values when generating observations for each parameter setting. We missed this important issue and modified the corresponding paragraphs in our latest version (see Benchmark Generation, pre-training objective function Eq.3 and Table 1 of benchmark statistics).


Q3. Potential advantages or biases from LMs to dynamics modeling.

One of the advantages on transferring PLM to dynamics is that we can utilize the sequence forecasting capacity of transformer. Analogous to the usage of pre-trained language models, our pre-trained \baby concentrates on how to learn better dynamics-enriched representations. And the dynamics modeling module can be analogous to the classification head or regression head when fine-tuning a language model for downstream tasks. After pre-training the embedder, we can learn generalizable embeddings for observations for any dynamics system, and these embeddings could be used for approximating dynamics by fine-tuning with any specific dynamics modeling methods. In this way, our embedder pre-training process could benefit approximating dynamics models more easily.

The variety and distinctiveness of systems ensure the capability of learning generalizable dynamics-enriched embeddings. When collecting benchmarks, we set various hyper-parameters, including system-specific parameters, number of objects and sequence lengths to generate various distinct dynamics observations, which can ensure the diversity of dynamics characteristics. The random initialization of initial states also devotes to the generalizability of learnt embeddings.


We hope to hear back from you if you have further questions.

评论

I would like to thank the authors for hard work. By adding MARE, they have tried to address my concern on not appropriate metrics for dynamical systems. If one inspects all the results, two possible conclusions can be derived: (i) now in majority of settings their method is not showing best performance. It is not necessary bad, if one has a contribution that improves understanding of problem. (ii) MARE can be very high even few orders of magnitude larger than 100% relative error. Which brings me to 2nd problem of authors not understanding how bad the forecasts for some dynamics really are. Which implies that the MSE, MAE were only showing an illusion of good performance. If one would look at the time-series visualizations of trajectories ground truth vs forecast, one would see the problem of super large MARE directly. e.g. in table 9 MSE=0.116, MAE=0.168 but MARE is 15286.5. Reason why the absolute errors are small is the scale issues.

Visualization of trajectories e.g. fig 2, page 19 is done in non professional way for serious publication (fonts size, values not readable, not writing what is on axis).

Authors also do not really test the model on different initial conditions. When one thinks about real-world applications, this becomes a problem. Which again is not necessary a problem, if the paper would improve our knowledge on modelling dynamics with neural systems. But this paper tries to do too many things at the same time, without paying enough attention to all the details, and that is why for me, I can not increase my score. Overall, interesting research, but not well done, it needs few more rounds of polishing, and critical view on the main contributions.

评论

Thank you so much for your professional and valuable comments.

Q1. About the extremely large MRAE.

In the past few days, we carefully compared the source codes and experimental settings between our method and baselines which adopt T-Drive and other realistic datasets as benchmark. We found that the extremely large MRAE on these datasets is caused by the calculation mode on the missing values (presented as 0 in the dataset). For example, more than 20% data points are missed and presented as 0 in T-Drive, CHIBike and NYCTaxi. The original baseline methods masked the missing targets and therefore perform non-abnormal results. While we didn’t mask these missing values and calculate MRAE by mean(y^yy+1e8\frac{|\hat{y}-y|}{|y+1e-8|}). Therefore, the existence of zero targets leads to rather large results. To sovle this, following STGODE, we masked the missing values and re-computed the MRAE results. Due to time limitation, we present the detailed results of our variants on T-Drive, CHIBike and NYCTaxi below.

PDEDERPDEDER-noprePDEDER-frz
TDrive0.3950.3410.425
CHIBike0.7440.7120.702
NYCTaxi0.4010.3640.428

Q2. About the visualization figures.

We re-draw the visualization of trajectories in our PDF as you suggested. Due to the time limitation, we will present more visualizations in our future version on more systems.


Q3. About testing on different initial conditions.

We are sorry that we didn’t express how we solve this clearly in our latest response. Actually, we examined this problem in every task. The results presented in all tables are the averaged results of multiple trajectories with different initial values. When generating trajectories for each system, we generate MsM_s samples with different randomly initialized values. During fine-tuning, we use all MsM_s samples to fine-tune one model and evaluate for each system and we report the averaged results in our paper.


Finally, we will work hard to revise our paper as you suggested and thank you so much for your professional comments which help us a lot to improve our paper!

评论
SystemGNSNDCNSTGODEMTGODEPDEDERPDEDER-noprePDEDER-frzPDEDER-sys
SD10%4.0525.5323.8382.2852.9583.0993.0183.029
20%3.4896.5733.5262.3884.9275.0864.9964.941
50%3.3917.1903.2282.1134.0184.0234.0434.016
70%3.2417.6373.1201.9763.7493.7233.7803.723
80%3.2257.5573.0842.0143.6773.6233.7123.643
100%3.2827.5203.2671.9293.6133.5623.6833.581
NYCTaxi2451.99072.02369.02930.944112.286113.80683.433114.767
4836.13241.95140.94717.09159.51060.46144.67960.569
9643.93851.43043.91013.35358.00860.90047.41457.876
19253.03858.84871.91310.60850.88553.78744.30450.721
33662.13069.98170.19711.10549.68752.05746.32351.752
72062.78963.80269.97111.73056.28956.62352.16358.768
CHIBike2415.73611.28720.72213.04617.73823.88219.47619.628
4849.07154.79523.1597.85518.07323.54318.03019.136
9639.27240.55862.84614.58645.47152.77843.36542.505
19253.35154.83376.55515.59257.54767.64456.50253.260
33660.74272.38298.04113.22961.07169.65060.06056.850
72089.368110.547136.72817.09086.629114.28184.58497.035
Tdrive247574.215736.027129.310453.315286.519079.914473.915618.7
4812647.716404.930402.69791.716292.320283.715643.016655.2
9615117.716976.428853.010139.217015.320387.716497.617389.7
19214503.918483.826906.78501.716026.118704.015550.516384.3
33614032.618978.923672.26640.114970.517032.214556.415269.7
72013592.219281.917588.14550.214584.516234.714261.414959.7
评论
SystemGNSNDCNSTGODEMTGODEPDEDERPDEDER-noprePDEDER-frzPDEDER-sys
PEMS03248.4542.7726.1502.1813.2353.6843.4413.560
487.9376.2237.6153.2765.1205.7745.5355.698
967.1845.6656.7672.7876.0896.8426.2836.698
1927.0966.1676.6553.0758.2799.3958.3648.922
3366.9935.5827.9592.6897.5628.7287.4948.140
7208.3585.6517.0272.4697.2908.4467.4277.726
PEMS04244.0496.5774.5542.4814.0074.4843.9753.946
483.8138.4056.5322.5635.1635.7585.1735.165
964.20111.5085.5152.4446.1586.9486.2216.177
1924.09212.6975.5512.3846.0796.8716.2356.113
3364.16814.3195.5202.5965.9646.7486.1056.034
7204.31717.9395.6882.5085.9166.6305.8996.094
PEMS07244.0806.3083.3421.5754.4725.0754.5534.123
484.0996.4383.3201.5515.3536.2655.6015.205
964.4557.2553.0701.6636.6877.9967.0136.815
1924.4717.3323.4391.7966.5007.7216.7646.721
3364.6637.3493.3591.8175.7716.8095.9936.063
7204.7887.6093.2591.7676.6447.5256.4167.261
PEMS08242.7888.4523.7253.1458.1779.6268.3238.631
483.15010.2703.5532.8397.4418.6477.5897.731
963.63811.6514.0062.6197.7169.0967.8047.954
1923.88714.0574.1622.7688.77510.6228.8848.975
3363.78413.0294.0402.7218.1029.6638.1918.254
7203.97712.4794.5073.1118.0509.0848.1098.036
评论
SystemGNSNDCNSTGODEMTGODEPDEDERPDEDER-noprePDEDER-frzPDEDER-sys
Gene10%1.8540.6452.8380.9741.5281.7902.3421.652
20%1.9840.8703.0101.0381.4991.7042.4091.590
50%2.1992.0482.3321.5671.7961.8173.4611.784
70%2.1362.8972.2071.4511.9281.8733.9631.891
80%2.0673.3462.2521.4061.9851.8944.0921.928
100%2.1113.9662.2821.3572.2682.0664.6382.243
ShallowWater10%1.0570.8040.8931.3061.6110.7320.9230.965
20%1.0491.7411.2321.2042.1831.4501.5651.643
50%1.0231.3031.0211.1321.9301.0711.4061.406
70%1.0191.3341.0331.1251.9381.0651.4211.409
80%1.0171.2731.0081.1201.8661.0381.3801.364
100%1.0151.3551.1221.1582.1061.3151.6201.612
2D_DiffReac10%10.47624.6611.0465.3384.9185.5654.8775.508
20%6.73215.3981.0803.4084.9654.9664.3915.028
50%5.38611.1521.1082.7444.2923.6193.4403.964
70%5.15611.3431.1462.5764.1233.6283.4683.819
80%4.98911.1071.1352.4333.9843.4633.3553.659
100%4.77510.3661.1282.3873.7513.3813.2243.548
LA10%2.5523.6963.3872.8732.4052.7872.3902.503
20%2.5743.4083.2522.1262.2452.6552.2402.358
50%2.6253.4053.1331.7802.0872.4562.0982.191
70%2.6663.3383.0921.7192.0392.3802.0462.134
80%2.7933.2343.0071.7182.0192.3402.0272.103
100%2.7003.0902.9071.7071.9882.2931.9902.067
评论

Q1. Adding forecasting visualizations, new baseline methods and the evaluation metric MRAE.

Following your suggestion, we added these empirical studies in our latest version. Details are presented below and in our updated PDF version. We kindly argue that the the seasonal characteristics considered in ES-dRNN are not available on the dynamics systems we adopted and we will try to consider this characteristics in our future research.

results of MRAE:

SystemGNSNDCNSTGODEMTGODEPDEDERPDEDER-noprePDEDER-frzPDEDER-sys
Mutualistic10%2.8401.0312.8751.2975.7026.3795.9496.136
20%4.2812.4022.5841.3025.8206.4916.0756.208
50%3.2215.2021.9851.1183.6754.1403.8963.947
70%2.5995.3611.6901.0822.8793.2563.0403.097
80%2.4055.3911.5981.0712.6252.9722.7622.824
100%2.1346.0171.4701.0582.2592.5642.3592.430
Heat10%2.6130.5423.1141.8470.3070.9100.7820.639
20%3.1920.4784.5112.0880.2640.8250.6670.593
50%9.2680.9208.3268.3840.8853.6431.4381.298
70%19.7151.41815.4809.0913.71912.3853.0553.448
80%22.4211.64417.5359.8934.53613.0633.9754.136
100%26.2572.00819.34611.0706.58114.4907.1446.230
2D_CFD10%8.45228.4761.54412.4351.1581.1701.1961.069
20%8.38434.1531.49212.2851.2121.2211.2491.122
50%11.43439.0771.67113.3641.5391.7041.7341.389
70%12.30637.6581.69712.5811.7151.8781.8721.590
80%12.74538.3801.72714.4361.8612.0442.0031.731
100%14.75341.2951.71716.2172.1182.3252.2481.993
DarcyFlow10%21.4891.06910.88721.0861.4043.6792.3331.376
20%20.6801.0849.53226.7001.3923.1312.3161.401
50%21.3511.1697.97028.6171.4862.1562.4791.476
70%20.9781.3087.46228.9511.4682.0592.5381.445
80%21.0001.4527.38429.8821.4492.0792.5491.428
100%20.83911.3797.10028.7461.4262.2272.5941.405
审稿意见
6

This paper proposes a generalized framework to learn system dynamics across different settings, by utilizing a pretrained language model from massive observational data, and jointly fine-tuning the pretrained language model and a Graph ODE-based neural simulator. The proposed PDEDER is pre-trained on 153 sets of observations from 24 complex systems, using a pre-trained language model updated via tokenization techniques. Experiments evaluate PDEDER on 18 dynamic systems for long/short-term forecasting in both in-domain and cross-domain settings.

优点

  1. The proposed generalized pre-trained dynamics encoder is well-motivated and technically sound.

  2. The proposed approach achieves good in-domain and cross-domain performance, highlighting its generalization ability.

缺点

  1. The writing needs further improvement. For example, the citation in the main text sometimes should be \citep (line 36-39 for example, with references within brackets) instead of \cite. Also in the problem setting section, can the authors justify if the graph structure (edges) are fixed or evolve over time?

  2. I feel the experiments can be further improved: for the baselines, they are all neuralODE-based approaches. However for dynamical system modeling, there are also many discrete neural simulators [1]. The authors are suggested to justify why these approaches are not compared in this paper. As mentioned in the abstract part, the proposed framework should be easily coupled with any dynamic modeling methods (besides neural ODEs). Also, there are works [2] that learn a generalized neural simulator trained from multiple systems. It is also suggested to include them in the paper for a more comprehensive comparison.

[1] Learning to Simulate Complex Physics with Graph Networks.

[2] Generalizing Graph ODE for Learning Complex System Dynamics across Environments

问题

  1. For the model implementations, I wonder if the model performance will be large affected by the dynamic modeling module during fine-tuning stage, such as changing into discrete GNNs or trained with one-step/multiple-step losses?

  2. What would be the runtime of the proposed method compared to others?

伦理问题详情

NA

评论

Thanks for your valuable comments. Here we respond to your comments and address the issues.

Q1. The usage of citation format.

Following your suggestion, we modified the format of citations in our updated paper (see the first paragraph of Introduction).


Q2. If the graph structure (edges) are fixed or evolve over time.

In this paper, we focus on the dynamics systems with fixed interacting graph as an early attempt in our research lines. While, we could also approximate any dynamics with evolving graph structures by substituting the dynamics learner by any specific methods, including both while-box and black-box learners, continuous and discrete learners, one-step and multi-step learners.


Q3. About adding baseline methods.

Following your suggestion, we added GNS as our baseline method and the detailed results are presented below. It's a pity that the source codes of GG-ODE are unavailable until we response the comments. We will try to reproduce GG-ODE and compare with it in the next version.

Results of GNS:

SystemMSEMAEMRAE
Mutualistic10%0.3280.4752.840
20%0.5200.6094.281
50%0.8550.7703.221
70%0.9120.8062.599
80%0.9300.8172.405
100%0.9560.8332.134
Heat10%0.4830.5452.613
20%0.4980.5583.192
50%0.5120.5799.268
70%0.5180.58719.715
80%0.5190.58922.421
100%0.5160.58926.257
2D_CFD10%0.4860.4838.452
20%0.4940.4808.384
50%0.4650.44911.434
70%0.4440.42912.306
80%0.4340.41912.745
100%0.4150.40114.753
DarcyFlow10%0.0050.05021.489
20%0.0050.04920.680
50%0.0050.04921.351
70%0.0050.04920.978
80%0.0050.04921.000
100%0.0050.04920.839
Gene10%0.5960.6331.854
20%0.6360.6541.984
50%0.6480.6502.199
70%0.6430.6392.136
80%0.6390.6332.067
100%0.6310.6222.111
ShallowWater10%0.9550.5651.057
20%0.9780.5721.049
50%0.9910.5771.023
70%0.9930.5781.019
80%0.9940.5781.017
100%0.9950.5791.015
2D_DiffReac10%1.1680.85010.476
20%1.1460.8416.732
50%1.1350.8375.386
70%1.1330.8375.156
80%1.1320.8364.989
100%1.1310.8364.775
评论
SystemMSEMAEND
LA10%0.9920.7892.552
20%0.9940.7892.574
50%0.9950.7892.625
70%0.9950.7882.666
80%0.9950.7882.793
100%0.9950.7872.700
SD10%1.0270.7414.052
20%1.0270.7433.489
50%1.0270.7463.391
70%1.0270.7473.241
80%1.0270.7483.225
100%1.0260.7473.282
NYCTaxi240.3230.39651.990
480.3300.40036.132
960.3360.40343.938
1920.3410.40753.038
3360.3410.40762.130
7200.3410.40762.789
CHIBike240.7190.25915.736
480.7200.25849.071
960.7200.25839.272
1920.7210.25953.351
3360.7220.25960.742
7200.7230.25989.368
Tdrive240.2250.2667574.169
480.2500.28312647.690
960.2690.29615117.720
1920.2840.30614503.859
3360.3090.32414032.609
7200.3500.35213592.221
PEMS03240.9690.8038.454
481.0220.8267.937
961.0880.8567.184
1921.1410.8817.096
3361.1200.8716.993
7201.1380.8798.358
PEMS04241.0320.6894.049
481.0290.6883.813
961.0270.6874.201
1921.0260.6874.092
3361.0260.6874.168
7201.0260.6874.317
PEMS07241.1040.8264.080
481.0980.8244.099
961.0920.8214.455
1921.0900.8204.471
3361.0890.8194.663
7201.0880.8194.788
PEMS08240.9330.6882.788
480.9340.6883.150
960.9350.6883.638
1920.9350.6883.887
3360.9350.6883.784
7200.9350.6873.977
NOAA240.5670.56017.798
480.6030.58020.415
960.7000.62622.065
1920.9000.71021.805
3360.9140.71521.309
7200.9070.71121.129

评论

Q4. About the effect of dynamics modeling module on model performance.

As presented above, the dynamics modeling module could be substituted into any strong learner. We substitute the GNN-based dynamics learner into the white-box SINDy and the results are still comparable. This indicates that our pre-training process has learnt effective representations for observations, which can benefit dynamics model learning with specific dynamics learner. Furthermore, We kindly argue that our main contribution lies in pre-training to learn better representations. Analogous to pre-trained language models, our pre-trained \baby concentrates on how to learn better representations. And the dynamics modeling module can be analogous to the classification head or prediction head when fine-tuning a language model for downstream tasks.

PDEDERPDEDER+SINDy
short-termlong-termshort-termlong-term
MSEMAEMSEMAEMSEMAEMSEMAE
Mutualistic0.3620.4520.8090.6751.0141.0140.3340.334
Heat0.0030.0450.0060.0520.8860.8841.5771.586
2D CFD0.2230.3030.1520.2361.0010.9841.1391.164
DarcyFlow0.0010.0200.0010.0210.8580.8511.1031.104
Gene0.0350.1360.0760.1720.6130.5370.7830.783
ShallowWater0.6740.3581.1450.5270.5380.4631.0401.047
2D DiffReac0.9600.7231.0570.7940.1260.1260.8070.808

Q5. Runtime comparison.

We present the running time comparisons of 1 epoch for each method on all dynamics systems. We can find that our proposal owns higher running speed overall.

PDEDERNDCNSTGDOEMTGODEGNS
mutualistic39s183s69s57s2995s
heat68s193s126s43s2647s
2D CFD12s17s15s12s358s
DarcyFlow60s98s60s56s1649s
gene50s43s52s49s777s
ShallowWater31s94s60s56s1400s
2D DiffReac42s32s74s66s625s
LA1s2s3s2s33s
SD1s2s4s2s31s
TDrive53s14s30s23s256s
CHIBike6s17s19s15s282s
NYCTaxi9s79s17s14s210s
PEMS0326s127s110s108s417s
PEMS0415s78s75s61s370s
PEMS0764s136s191s185s493s
PEMS0811s79s65s45s384s
NOAA7s31s23s18s307s

We hope to hear back from you if you have further questions.

评论

Thank for the detailed response. My questions are mostly resolved. I hope the authors to incoporate those new results and discussions in the revised version. For the \citep v.s. \cite part, it is also observed in other parts of the paper and please revise them accordingly. Also, I feel it is necessary to have a detailed discussion towards neural simulators including both discrete and continuous approaches, and in general some generalized neural simulators in the paper. I have raised my score to 6.

评论

Thank you so much for raising the score. We will carefully revise and improve our paper as you suggested!

评论

We thank all reviewers for your careful consideration. We greatly appreciate the positive comments and address major concerns below.

Q1. About the dimension reduction in Data Projection.

We are grateful to thank all reviewers for pointing out this problem. We have modified this section to make it clearer. In practice, we adopt a system-specific flatten-linear layer f(;Wdps)f(\cdot;\mathbf{W}_{dp}^s) to align the feature dimensions. The detailed modifications are presented below and we modified in our latest PDF version.

Data Projection To handle dimension diversity of states across different systems, we adopt a flatten-linear data projection module to align the observations by mapping into same dimensions. For each patched tokens xm,n(in)RPm×Lp×Vs\overline{\mathbf{x}} _ {m,n}^{(in)}\in\mathbb{R}^{P _ m \times L _ p \times V _ s}, we first flatten it into xm,n(in)(fl)RPm×(LpVs)\overline{\mathbf{x}} _ {m,n}^{(in)(fl)} \in\mathbb{R}^{P _ m \times (L _ p \cdot V _ s) } , and then project it by a linear layer into the dimension of LpL _ p for all systems x~m,n(in)=f(xm,n(in)(fl);Wdps)\tilde{\mathbf{x}} _ {m,n}^{(in)} = f(\overline{\mathbf{x}} _ {m,n}^{(in)(fl)};\mathbf{W} _ {dp}^{s}), where WdpsR(LpVs)×Lp\mathbf{W} _ {dp}^{s}\in\mathbb{R}^{(L _ p \cdot V _ s) \times L _ {p}} denotes the system-specific trainable parameters.


Q2. About improving the experiments.

Thanks for all reviewers for helping us improve the empirical studies. Following your valuable comments, we improved our experiments from the following aspects:

  • Adding a baseline method [1];
  • Adding an evaluation metric MRAE;
  • Adding the forecasting visualizations;
  • Adding sensitivity studies on hyper-parameters patch length and stride;
  • Adding a white-box dynamics learner SINDy [2] on downstream fine-tuning to express the interpretability of embeddings generated by our proposed pre-trained embedder PDE\textsc{der}\xspace.
  • Adding the runtime comparisons over baselines.
  • Renaming the "Ablative Study'' into "Impact Evaluation of Pre-training on downstream Dynamics Modeling'' and edited its settings.

The details of improved experiments are presented below, and also modified in the updated PDF of our paper.

[1] Sanchez-Gonzalez, Alvaro, et al. Learning to simulate complex physics with graph networks. In International conference on machine learning. 2020: 8459-8468.

[2] Brunton S L, Proctor J L, Kutz J N. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the national academy of sciences, 2016, 113(15): 3932-3937.

AC 元评审

The paper introduces a method for learning unified latent representations to model the dynamics of multiple physical phenomena. This unified encoding is intended to be utilized within an encode-process-decode framework for modeling temporal or spatio-temporal dynamics. Experiments are conducted on a series of dynamical systems.

In response to the reviewers' comments, the authors enhanced the initial version of the paper and provided additional experimental validation. However, concerns persist regarding the organization and clarity of the technical descriptions, as well as technical contributions, such as the connections between the pre-training objective and the modeling of dynamics. Overall, this is an interesting contribution, but it remains preliminary and requires further refinement before publication.

审稿人讨论附加意见

The main concerns pertain to organization, clarity, and technical contributions. Although the authors provided new experimental results, the majority of reviewers found these insufficient.

最终决定

Reject