Generalizing Weather Forecast to Fine-grained Temporal Scales via Physics-AI Hybrid Modeling
摘要
评审与讨论
The paper proposes a physics-AI hybrid modeling framework for fine-grained weather forecast. They propose to adaptively tune a PDE kernel together with a neural network as the encoder. Following the Euler time stepping, the PDE kernel can perform a fine-grained temporal forecast, which act as a physics-guided modeling part.
优点
- The combination of AI and physics is crucial and novel for weather forecast.
- The fine-grained weather forecast is of interest to nowcasting and temporal downscaling.
缺点
Some opportunities to improve:
-
The experimental details are way from sufficient. What are the hyperparameters you use, except for learning rate? How do you divide validation and test set? How do you divide input and label? What are the inputs and outputs to the model? Their sizes? What are the datasets' statistics, as there is none introduced? What is the time cost or number of parameters for your model compared to other models? How do you obtain the results in the tables, since there are no error bars? Are these all based on a one-time run? Are they statistically significant, and what are the p values, etc.? Table 3 does not demonstrate that the proposed model is the best model. Is there a specific reason why 120-min forecast is especially good? Any pseudo algorithm for understanding? Is there code for understanding and checking?
-
The experimental results are both not good enough and not analyzed well enough. It looks strange from Figure 4 that FourcastNet is much worse than ECMWF-IFS even though FourcastNet should have been better as reported in the original paper. There needs discussion to justify, or maybe it is due to experimental setting. It is hard to know due to lack of experimental details pointed out above. The nowcast results are basically suggesting that the proposed model is no better than previous models, except for 120 min that is a less realistic setting for real life. Figure 6 is revealing very minimum information about the comparison between models. All the errors look alike, and it is hard to know which model's prediction is better without ground truth. I am not convinced by the explanation why the physics weight is decaying within each hour: unless your neural network counterpart's prediction does not change much, the ratio between physics and AI does not directly reflect how much they change. It could be that the AI part is predicting/contributing less, although the weight is surging. It does not mean anything.
-
I want to highlight this problem using a separate paragraph. The ablation study is strange: it seems the physics part is not always helping. This might relate to how you use the PDE kernel and derive the PDEs. Your appendix A needs to cite references and name the equation names you are using. For example, how do you determine the coefficient for those constants in the equations? Shouldn't that be learned, since you cannot tell how big is the friction coefficient, let alone the term, for example? Moreover, Eq. 14 seems incorrect. I cannot recall such an equation in fluid dynamics. It is the continuity equation if you change p to z. However, pressure level slash geopotential slash height are not strictly the same thing. I feel concerned that they are assumed to be the same without any justification.
问题
I encourage the authors to address my concerns listed in the weaknesses.
局限性
Limitations are discussed.
Dear Reviewer,
Thank you for your thoughtful review! We are pleased that you appreciate the innovative combination of AI and physics. We will address your remaining questions below.
Q1: The experimental details are not sufficient.
Thank you for your suggestions. The table below outlines the experimental details:
| Hyperparameter | Value |
|---|---|
| Max epoch | 50 |
| Batch size | 4x8(GPUs) |
| Learning rate | 5e-4 |
| Learning rate schedule | Cosine |
| Patch size | 4x4 |
| embedding dimension | 1024 |
| MLP ratio | 4 |
| Activation function | GLUE |
| Input (0-hour) | [4,69,128,256] |
| Output (1,3,6-hour) | [4,3,69,128,256] |
| Datasets | Training set | Validation set | Test set | Time resolution | Variable |
|---|---|---|---|---|---|
| WeatherBench | 1980-2014 | 2015 | 2017-2018 | 1h | tp,t2m,u10,v10,z,q,u,v,t |
| NASA | None | None | 2017-2018 | 30min | tp |
Q2: The time cost for your model.
As shown in the global response PDF, introducing the PDE kernel slightly increases training time, but the added computational cost is acceptable.
Q3: Error bars and p values.
The error bars of RMSE is displayed in the global response PDF.
We apply the Wilcoxon test to demonstrate that our model achieves a lower RMSE compared to the original model at the 95% confidence level (p-value<0.05).
| Lead time (h) | 6-hour | 3-day | 4-day | 5-day |
|---|---|---|---|---|
| p-value t2m | 1.42e-14 | 1.42e-14 | 1.42e-14 | 1.42e-14 |
| p-value t850 | 1.42e-14 | 1.42e-14 | 2.84e-14 | 7.10e-14 |
| p-value u10 | 1.42e-14 | 1.42e-14 | 1.42e-14 | 2.84e-14 |
| p-value z500 | 8.04e-04 | 2.83e-02 | 5.68e-05 | 6.39e-08 |
Q4: Table 3 does not demonstrate that the proposed model is the best model. Why 120-min forecast is especially good?
Firstly, it should be pointed out that our model did not fit 30 and 90-min predictions during training, but achieving generalization in a unified model. In other words, our model achieves SOTA on most metrics under a more stringent setting (without interpolation models).
Secondly, precipitation forecasts with a lead time of 120-min or even longer serve as experimental settings [1] and play a crucial role in predicting disasters such as mudslides [2].
Thirdly, as the lead time increases, the forecasting difficulty increases, and the advantages of our model can be better demonstrated. We added the tp RMSE at 180-min to emphasize that the benefits of our model are more pronounced.
| Model | tp RMSE@180-min ↓ |
|---|---|
| FourCastNet | 0.88 |
| ClimODE | 0.39 |
| Keisler | 0.43 |
| WeatherGFT (ours) | 0.28 |
Q5: Pseudo algorithm and code.
Part of the code for the PDE kernel is included in Appendix B of the paper. The complete code will be released after the paper is accepted.
Q6: FourCastNet should have been better than ECMWF-IFS in the original paper.
In the original paper of FourCastNet [3], Figure 1 shows that RMSE z500 of FourCastNet are higher than those of ECMWF-IFS, which indicates that its prediction is relatively poor. Even in the case of FourCastNet v2 [4], its forecast skill for z500 does not surpass that of ECMWF-IFS.
Q7: All the errors in Figure 6 look alike.
The red boxes in Figure 6 show that the predictions obtained by other AI models often have problems with smoothness and lack of extreme values. On the contrary, the predictions of our model have more details and more accurate extreme values.
Q8: Explanation on router weight.
To ease your concerns, we calculate the input norm for each router as outlined in the global response PDF. The AI and physics feature norms remain consistent and similar (0.5:0.5), indicating the router's independence from both PDE kernels and attention blocks. This decoupling ensures that weight variations do not influence input. Both AI and physics branches outputs of similar magnitude, while the router dynamically selects the better part from them.
Q9: Physics is not always helping.
Using PDE kernel without changing the training method may not necessarily lead to improved RMSE. If we only use lead time=6h labels for training supervision, the relatively front PDE kernels of this deep networks will not be effectively trained due to gradient disappearance. After adding 1h and 3h supervision to the middle part of the network, the role of the PDE kernel can be better reflected.
In addition, the bias and energy metrics in the global response PDF indicate that integrating the PDE kernel effectively addresses the issue of energy decay with increasing lead time.
Q10:How do you determine the coefficient for constants.
Equation 7 is the kinematic equation in the basic atmospheric equations. in Equation 7 represents the friction force, which has a very small dimension (10e-12), so it can be omitted in the calculation refer to the Section 2.4 of [5]. in Equation 15 represents the Coriolis force, where is the geostrophic parameter setting as a constant refer to the Section 4.6.3 of [5]. Other constants are also referenced from [5].
Q11: Pressure level and geopotential height are not the same thing.
While we appreciate the reminder, we would like to clarify that we do not confuse the p-coordinate and the z-coordinate. Our PDEs are all converted to the p-coordinate refer to Sections 1.6.1-1.6.2 of [5]. We will give a more detailed proof in later versions.
References
[1]Sønderby C K, Espeholt L, Heek J, et al. Metnet: A neural weather model for precipitation forecasting.[2]Brunetti M T, Melillo M, Peruccacci S, et al. How far are we from the use of satellite rainfall products in landslide forecasting?[3]Kurth T, Subramanian S, Harrington P, et al. Fourcastnet: Accelerating global high-resolution weather forecasting using adaptive fourier neural operators.[4]Bonev B, Kurth T, Hundt C, et al. Spherical fourier neural operators: Learning stable dynamics on the sphere.[5]Holton J R. An Introduction to Dynamic Meteorology. Forth edition.
If our responses have clarified your inquiries, we kindly ask for an update to your score. Thank you very much for your time.
Thank you for your response. I have read the rebuttal carefully. I raise my score to 5.
Thank you very much for enhancing the score. We appreciate your careful review once again!
In this paper, a hybrid model (a model that combines machine learning with physics) is demonstrated for nowcasting and medium-range weather forecasting. WeatherGFT uses machine-learned weightings that combine two successful methods for weather forecasting (machine learning and the traditional numerical, PDE-based method). The combination of both methods allows for significantly increased temporal resolution compared to purely data-driven methods (6-hourly vs 15 minutes). This approach also provides a new framework for combining physics-based methods with data-driven methods, which allows for the ability to differentiate through the hybrid model.
优点
Combining machine learning with physics-based approaches (e.g. hybrid modeling) is an exciting and highly researched topic for weather and climate. As mentioned in the paper, purely data-driven models are black-box systems even if they do perform well. Existing data-driven models also are too temporal coarse for some operational weather forecasting applications. This paper does a good job address both of these existing problems in the field.
缺点
The novelty of the paper is not as strong as claimed in the paper. Other hybrid models that combine machine learning with physic-based approaches exist (see Arcomano et. al. 2022 and 2023, Clark et. al. 2022, and others). The Arcomano et. al. 2022/2023 even includes a machine-learned weighting for the combination of the ML-based model and the physics-based model.
The time-embedding and multi-lead times in one model have also been demonstrated before, see Stormer (https://arxiv.org/abs/2312.03876) and MetNet (https://arxiv.org/abs/2306.06079).
问题
In section 4.3, I don’t understand how FourCastNet, ClimODE, or Keisler was used for nowcasting. They all have a time step of 6 hours, how is it possible to get 30-minute forecasts using frame interpolation methods? Are you using interpolation between ERA5 (e.g. the initial conditions) and a 6-hour forecast from these models?
In section 4.3, how are the forecasts including WeatherGFT initialized? Do they all use ERA5?
I would like to see the effects of the PDE Kernel on other metrics. Does this inclusion of a physics kernel improve spectral bias or allow stability? What about the conservation of energy or momentum?
What is the computational cost during inference of having a PDE kernel compared to just the transformer by itself?
局限性
Overall the limitations of the paper are well laid out, however, some claims are not supported by the paper.
Claims about being the first to move away from fixed lead times for data-driven weather forecasting are not true. Stormer (https://arxiv.org/abs/2312.03876) and MetNet (https://arxiv.org/abs/2306.06079) have demonstrated this previously.
Claim “In addition, the prediction error of our model at the lead time of 6-hour is significantly smaller than that of the physical dynamic model ECMWF-IFS”. For Z500 this is not supported by Figure 4.
For Figure 5 the prediction of the subtropical high (it should be a subtropical ridge) isn’t convincing. That seems to be a function of the contouring in Matplotlib as the difference plots show similar magnitude of errors.
Overall, if some of these claims are toned down and the addition of the appropriate citations the manuscript will be improved.
Dear Reviewer,
Thank you for your insightful review and detailed feedback! We are glad for your recognition of the significance of our hybrid modeling of physics and AI.
We appreciate your feedback on specific claims in our paper, and will refine specific statements and add citations of relevant papers as necessary. In the following, we will address your remaining questions.
Q1: Comparisons with other hybrid models that combine machine learning with physics.
While there exist various methodologies for integrating AI and physics, our approach diverges in both methodology and focus from those mentioned. The referenced paper primarily integrates machine learning techniques into a complete dynamical model (e.g., AGCM) to improve the forecast performance of this physics model. But our method focuses on combining PDE processes with neural networks, rather than improving the existing physical model. In addition, our focus is on finer-scale physical modeling to generalize finer-scale predictions without valid training labels, which differs from focusing on improving forecasting skills.
Q2: The time-embedding and multi-lead times in one model have also been demonstrated before, see Stormer and MetNet.
Thank you for your reminder. It should be noted that time-embedding and multi-lead times are not our core innovations (our core innovation lies in the physical modeling). They represent technical innovations in model implementation. We will increase the citations of the papers related to these concepts.
However, we would like to emphasize that these methods differ from previous work. In Stormer and MetNet, there is no direct correlation between the number of network layers and the forecast lead time. In MetNet, the lead time condition serves as input for all network layers, while our network module (HybirdBlock) evolves within a short timeframe without lead time conditions. In our approach, the lead time condition is solely fed into the decoder to extract predictions from different network layers' outputs.
For multi-lead time training, the method of our paper is a completely different technology from the Multi-step finetuning in Stormer. Stormer's finetuning focuses on autoregressive prediction, and its network itself can only output predictions for the next single step. In contrast, our method has the capability to generate predictions for multiple steps within a single forward process.
Q3: How FourCastNet, ClimODE, or Keisler was used for nowcasting?
For the precipitation nowcasting experiment, we use 1-hour interval ERA5 to train FourCastNet, ClimODE, and Keisler, respectively, to obtain models that can make 1-hour predictions. (NOTE: In the medium-range forecast experiment, these models have a lead time of 6 hours during training.) However, as ERA5 lacks half-hour data, these models are unable to directly provide half-hour predictions. Therefore, we utilize additional interpolation models to interpolate the 1-hour predictions of these models to 30-min.
Q4: In section 4.3, how are the forecasts including WeatherGFT initialized? Do they all use ERA5?
The initial fields of all models are from ERA5. The only difference is that other models need to interpolate their prediction results to 30-min, while our method can directly get 30-min of prediction from the networks.
Q5: Does this inclusion of a physics kernel improve bias? What about the conservation of energy?
This is a good suggestion and helps us to have a more comprehensive understanding of the role of the PDE kernel. We measured bias and energy and found that the PDE kernel plays a positive role in maintaining energy, as shown in global response PDF. We believe that the preservation of energy is intimately linked to enhanced physical modeling, exemplified by the dynamic equations outlined in Equation 7. We present the calculation methods and related references of bias and energy in this PDF.
Q6: What is the computational cost of having a PDE kernel compared to just the transformer by itself?
As shown in the global response PDF, introducing the PDE kernel slightly increases training time, but the added computational cost is acceptable.
Q7: Claim “In addition, the prediction error of our model at the lead time of 6-hour is significantly smaller than that of the physical dynamic model ECMWF-IFS”. For Z500 this is not supported by Figure 4.
We apply the Wilcoxon test to demonstrate that our model achieves a lower RMSE compared to the ECMWF-IFS at the 95% confidence level (p-value<0.05). The table below indicates that the improvement in z500 is relatively smaller than that in other variables when compared to the ECMWF-IFS, which results in the z500 curve in Figure 4 closely resembling that of the ECMWF-IFS.
| Lead time (h) | 6-hour | 3-day | 4-day | 5-day |
|---|---|---|---|---|
| p-value t2m | 1.42e-14 | 1.42e-14 | 1.42e-14 | 1.42e-14 |
| p-value t850 | 1.42e-14 | 1.42e-14 | 1.42e-14 | 1.42e-14 |
| p-value u10 | 1.42e-14 | 1.42e-14 | 1.42e-14 | 2.84e-14 |
| p-value z500 | 1.42e-14 | 1.25e-03 | 9.86e-04 | 6.26e-04 |
Q8: For Figure 5 seems to be a function of the contouring in Matplotlib as the difference plots show similar magnitude of errors.
The color bar we employ is consistent and standardized, with its upper and lower bounds derived from the maximum prediction error across all forecasts. By examining the error visualization in Figure 5, it is evident that our predictions exhibit reduced errors. The experiment depicted in Figure 4 quantitatively demonstrates that our model boasts a relatively lower RMSE. Moreover, as detailed in the response to Q7, the results displayed in Figure 4 hold significance.
We will introduce additional visualizations in the appendix to illustrate that our model yields comparatively smaller prediction errors.
If our responses have clarified your inquiries, we kindly ask for an update to your score. Thank you very much for your time.
I would like to thank the authors for addressing my concerns and answering my questions. If accepted I suggest adding some of the plots and or inference speed comparisons to the paper. I raise my score to a 7.
Thank you for enhancing the score. We will continue to improve our paper according to your valuable suggestions. Your meticulous review is greatly appreciated. Thank you once again.
The paper proposes WeatherGFT, a physics-AI hybrid model designed to generalize weather forecasts to finer temporal scales beyond the training dataset. By integrating PDE kernels for physical simulation and neural networks for adaptive bias correction, the model aims to provide accurate 30-minute forecasts using an hourly dataset. The lead time-aware training framework enhances the model's ability to generalize across multiple lead times, achieving state-of-the-art performance in both medium-range and nowcasting tasks.
优点
- The framework that fuses AI and PDE is innovative and improves the model's generalibility.
- The model demonstrates generalization capabilities across time scales, achieving finer temporal resolutions (e.g., 30-minute forecasts) from coarser data.
- Extensive experiments validate the model's state-of-the-art performance across various forecasting tasks and lead times.
缺点
- The paper is well-structured but could benefit from clearer explanations and analyses of the novel modules and specific contributions.
- The physics model relies on a limited set of PDEs for simulation, which may not fully capture the intricacies of real-world atmospheric dynamics.
- The baseline comparisons are limited, as many AI models available for weather forecasting have not been included.
问题
- Can you further explain Equation 5 and how you formulate and
- How does the convolution layer work to align neural network features with physical features?
- How do you update the learnable factor ? How do you compute the weight to draw Figure 1 since the weight is a vector?
- Does tp (hourly precipitation) have a related PDE? How do you deal with variables that are not related to any PDE?
- How fine-grained can the model achieve? Doing 30-minute forecasts with hourly data is impressive, but can the model achieve finer resolutions like 15 minutes or even 5 minutes?
局限性
Yes.
Dear Reviewer,
Thank you for your thoughtful review and detailed feedback! We are delighted that you value the innovative fusion of AI and physics in our research. We have revised the paper according to your suggestions and will now respond to the remaining queries you have.
Q1: The paper is well-structured but could benefit from clearer analyses.
Thank you for your affirmation. We will add more quantitative and rigorous analysis to the paper to fully demonstrate our model. In the global response PDF, we added the evaluation of two indicators, bias and energy. The results show that using PDE kernel can assist in making the model's prediction field energy more consistent.
Q2: The physics model relies on a limited set of PDEs.
We agree that exploring more PDEs is valuable. Currently, we have employed only a few PDEs which results in performance enhancements and demonstrates the effectiveness of our design in combining AI and PDEs. Moving forward, we intend to augment the number of PDEs to simulate atmospheric dynamics more comprehensively.
Q3: The baseline comparisons are limited.
We have included additional model comparisons, as depicted in the table below. Specifically, SphericalCNN is a CNN model designed for spherical data, DMNWP utilizes a diffusion model for weather prediction, and EWMoE employs a mixture of experts (MoE) for weather forecasting.
| RMSE z500 | 6h | 3day | 4day | 5day |
|---|---|---|---|---|
| SphericalCNN | 28.40 | 161.1 | 239.9 | 338.4 |
| DMNWP | 52.33 | 272.3 | 360.7 | 466.8 |
| EWMoE | 23.52 | 165.3 | 240.1 | 341.6 |
| WeatherGFT | 22.08 | 152.3 | 225.8 | 315.7 |
| RMSE t850 | 6h | 3day | 4day | 5day |
|---|---|---|---|---|
| SphericalCNN | 0.494 | 1.183 | 1.493 | 1.860 |
| DMNWP | 1.073 | 1.823 | 2.247 | 2.551 |
| EWMoE | 0.513 | 1.259 | 1.593 | 1.865 |
| WeatherGFT | 0.457 | 1.176 | 1.480 | 1.839 |
Q4: Can you further explain Equation 5 and and ?
Equation 5 shows the differential and integral operators in the model. is the convolution kernel. Assume a one-dimensional data . It gradually increases from left to right by 1, that is, its gradient is 1. Applying convolution kernel to , the result is:
By using this convolution kernel, the data gradient can be determined.
obtains the integral through matrix multiplication. Given the matrix below, the result of is:
1 & 4\\\\ 2 & 5\\\\ 3 & 6 \end{bmatrix},\ xM_x= \begin{bmatrix} 1 & 4\\\\ 2 & 5\\\\ 3 & 6 \end{bmatrix} \begin{bmatrix} 1 & 1\\\\ 0 & 1 \end{bmatrix}=\begin{bmatrix} 1 & 1+4\\\\ 2 & 2+5\\\\ 3 & 3+6 \end{bmatrix}$$ > Q5: How does the convolution layer work to align features? The size of the tensor of the output of the PDE kernel is different from that of the output of the attention. The size of the output of the PDE kernel is $[8,5,13,32,64]$, where each dimension represents *batch size, (z, q, u, v, t), pressure levels, H-patch, W-patch*. The size of the output of the attention is $[8,1024,32,64]$, where each dimension represents *batch size, embedding dim, H-patch, W-patch*. We first reshape the output of the PDE kernel to $[8,5\times 13,32,64]$. Then, through a convolution layer `Conv(in_channel=65, out_channel=1024)`, the size of the tensor is converted to $[8,1024,32,64]$, aligning with the output from the attention block. > Q6: How do you update the learnable factor $r$ and compute the weight to draw Figure 1? Below is the PyTorch code for initializing and using the learnable factor $r$. ```python import torch.nn as nn r = nn.Parameter(torch.zeros(1, 1, 1, dim), requires_grad=True) def router(physics_features, ai_features): # Features size: [B, H, W, dim] r1 = 0.5*torch.ones_like(physics_features) + r r2 = 0.5*torch.ones_like(ai_features) - r mixed_features = r1*physics_features + r2*ai_features return mixed_features ``` Through the automatic gradient function of PyTorch, $r$ will be updated through model backward. After the model training is completed, we first average the $r$ of each HybridBlock to obtain 24 values corresponding to 24 blocks. Then we divide them into 4 groups, namely: ```python # 1, 2, 3, 4, 5, ..., 24 # 00:15, 00:30, 00:45, 01:00, 01:15, ..., 06:00 [1, 5, 9, 13, 17, 21] # 15min [2, 6, 10, 14, 18, 22] # 30min [3, 7, 11, 15, 19, 23] # 45min [4, 8, 12, 16, 20, 24] # 60min ``` Then, we can average the $r$ of each group and draw `Figure 1`. > Q7: Does tp have a related PDE? How do you deal with variables that are not related to any PDE? There is no PDE specifically for precipitation, since precipitation is considered as a diagnostic variable in the atmospheric dynamics, meaning that precipitation can be derived from other atmospheric variables. For these diagnostic variables, we utilized neural networks to predict their values. While these variables may not directly influence the PDE kernel, their information will be amalgamated through the router to accomplish implicit modeling. > Q8: How fine-grained can the model achieve? Doing 30-minute forecasts with hourly data is impressive, but can the model achieve finer resolutions like 15 minutes? As given in `Figure 2` and `Section 3.6`, each PDE kernel simulates atmospheric dynamics within 300 seconds, allowing our model to make predictions at various time intervals by stacking these kernels. Theoretically, the smallest time scale for our model to forecast is 15 minutes, this is because one attention block is composed of 3 PDE kernels, equating to 3x300s=15min. Presently, we are unable to evaluate forecasts at finer scales than 30min due to the lack of ground truth at the 15min. Nevertheless, the generalization of 30min predictions showcased in the paper has already proven the efficacy of our physical-hybrid modeling approach. If our responses have clarified your inquiries, we kindly ask for an update to your score. Thank you very much for your time and feedback.Thanks for the clarification and additional results. They addressed my concerns. I've raised my score.
Thank you immensely for the update. We genuinely value your kind words and highly constructive feedback!
Dear Reviewers,
We thank all reviewers for their efforts in reviewing our submission and their recognition of our work, e.g., 'fuse AI and PDE is innovative' and 'demonstrate generalization capabilities' from Reviewer CYo5, 'the paper does a good job address both of these existing problems' from Reviewer oEsH, and 'crucial and novel for weather forecast' from Reviewer jiN5.
In the following, we offer general responses to the common questions and concerns raised by the reviewers. More detailed responses to each specific comment can be found in our rebuttal to each review.
- In response to questions from Reviewer oEsH and Reviewer jiN5 about the significance of our model's performance improvement, we included the Wilcoxon test to demonstrate that our model's enhancement is statistically significant at a 95% confidence level (p-value < 0.05).
- Reviewer CYo5 has expressed concerns regarding the adequacy of our explanation of integral and differential operators, while Reviewer oEsH and Reviewer jiN5 have raised questions about certain experimental details. Owing to the constraints of the paper's page limit, our coverage of these specifics in the main text is incomplete. We have responded to these queries in each rebuttal and will incorporate them into the appendix for the upcoming version. Thanks for your suggestions.
- Reviewer oEsH has noted similarities between some of the methods in our model and those found in previous papers. Thanks for your reminder, we will include references to relevant work where appropriate. However, we contend that although these works may share some similarities, they diverge in both motivations and methods. Our research emphasizes achieving generalization on a smaller scale through physical modeling, contrasting with studies that leverage AI techniques to enhance the forecasting capabilities of a physical dynamic system.
In the global response PDF, we present visualizations of five new results:
- Bias: Bias indicates the disparity between the model's predictions and the ground truth. Negative bias indicates underestimation, a prevalent issue in forecasting models. Although the PDE kernel was not specifically designed to address bias underestimation, experimental results indicate that its usage helps ameliorate underestimation.
- Energy: This assesses the energy changes in the model's predictions. The experiments reveal that employing the PDE kernel aids in energy preservation.
- Comparison of Time Consumption: Introducing the PDE kernel slightly increases training time, but the added computational cost is acceptable.
- RMSE Error Bars: Error bars of RMSE values for different variables across various lead times.
- Router Weights and Features Norm Change: This figure complements
Figure 7in the paper. It illustrates that physical and AI features are on a comparable scale, with the router dynamically selecting the more effective aspects from each. The router's weight adjustments do not impact the output of the AI or physical branches, highlighting the router's decoupling characteristics.
We have provided tailored responses to each reviewer's queries. Should you have further questions, we are eager to engage in discussion and address them promptly. If our responses have clarified your inquiries, we kindly ask for an update to your score. Thank you very much for your time and feedback.
Sincerely,
Authors
Dear Reviewers,
The authors have provided comprehensive rebuttals and tried to address the concerns raised in your reviews. Please take the time to review their responses carefully. If you have any further questions or require additional clarification, please engage in a discussion with the authors. Thank you for your continued efforts.
AC
This paper introduces WeatherGFT, a physics-AI hybrid model designed to extend weather forecasts to finer-grained temporal scales beyond the limits of the training dataset. The proposed model demonstrates state-of-the-art performance across multiple lead times and shows a notable ability to generalize to 30-minute forecasts, addressing a significant challenge in the field.
The authors have provided additional results during the rebuttal phase, including statistical tests and visualizations, which further support the robustness of their approach. Following the rebuttal, the paper received three positive scores (7, 7, 5), reflecting the overall strength of the contributions. Based on the reviewers' assessments and the authors' responses, this paper offers a valuable contribution to the field of weather forecasting.
This paper claims state-of-the-art performance for medium-range weather forecasting, which is manifestly untrue and misrepresents the state of the art.
The WeatherBench2 paper and website lists a number of AI models with significantly higher skill than ECMWF IFS and WeatherGFT, including GraphCast, FuXi and NeuralGCM. Over a dozen AI models with such performance have been reported in the literature.
The authors cannot credibly claim ignorance of this prior work:
- NeuralGCM is cited here, although it is incorrectly reported to be "primarily designed for medium-range forecasting" and the fact that it makes sub-hourly predictions is not mentioned.
- Three of the authors of this paper were also authors of FengWu, another model that reported clearly significantly higher skill than any of the baselines from this paper.
First of all, we greatly value your attention to our work and appreciate you pointing out the relevant research progress. At the stage of our research (up to April 2024), we did notice some work on physical-AI hybrid modeling, including NeuralGCM (published on ArXiv in November 2023) and ClimODE (published on OpenReview in January 2024). However, at the time of our submission, only ClimODE was fully open-source and allowed for free resolution adjustments, enabling reproducibility and comparison under the same experimental settings. Therefore, we used it as a benchmark for low-resolution weather forecasting tasks.
Regarding the NeuralGCM you mentioned, we cited this work in our paper, but prior to our submission, the GitHub repository for this model did not provide complete training code, making it difficult for us to train the model and compare under the same experimental settings.
The core innovation of our paper lies in enhancing the generalization capability of AI models in fine-grained predictions by introducing physical processes (PDE kernels). As stated in the paper, our model demonstrates good performance in multi-temporal scale generalization, especially in situations with limited training samples, such as 30-min predictions.
Moreover, as a work primarily focused on methodological innovation, we only conducted our validations on low spatial resolution (1.4 degrees) data to demonstrate the effectiveness of our method, without directly optimizing performance on high-resolution (0.25 degrees) data (for example, using relatively complex fine-tuning strategies to reduce multi-step prediction errors).
Thank you for your important points; we will take them seriously and look forward to reflecting the advancements in this field more comprehensively in our future research. We hope our response enhances the understanding and recognition of our work.