6.3

/10

Poster3 位审稿人

最低5最高8标准差1.2

4.3

置信度

正确性2.7

贡献度2.7

表达3.0

NeurIPS 2024

Scaling transformer neural networks for skillful and reliable medium-range weather forecasting

Tung Nguyen,Rohan Shah,Hritik Bansal,Troy Arcomano,Romit Maulik,Veerabhadra Kotamarthi,Ian Foster,Sandeep Madireddy,Aditya Grover

OpenReview PDF

提交: 2024-05-15更新: 2024-11-06

TL;DR

We introduce Stormer, a simple transformer model that achieves state-of-the-art performance on weather forecasting with minimal changes to the standard transformer backbone.

摘要

关键词

deep learningtransformersweather forecastingclimate modelingAI for climate

评审与讨论

审稿意见

评分: 8置信度: 42024-07-09

This paper introduces a deep learning weather prediction model called Stormer. Stormer is vision transformer type network that employs various techniques to improve the network's performance in weather forecasting applications, including a "weather-specific embedding" that first processes each variable separately, and a technique for producing multiple weather scenarios for each input by using the model's ability to be tuned to various lead times. By using a lower spatial resolution than competing models, Stormer is much faster to train and inference than current state of the art models, while still achieving competitive performance at short-to-medium time scales and superior performance at longer time scales.

优点

The paper shows that high-performance weather prediction DL models do not necessarily need to use high resolution, and in this way can save a large amount of computing power while still achieving good performance. The inclusion of adaptive layer normalization to enable variable lead times within the single model is also an interesting development. Furthermore I find the strategy for using the variable time stepping to average multiple forecasts an interesting technique.

The paper is very well written and the methodology was easy to understand from the description.

缺点

I do not see serious weaknesses with the paper. However, one point that could be improved is with the randomized forecast strategy introduced in the paper. PanguWeather is also able to make weather forecasts with variable lead times (although it achieves this with separately trained models). If I'm not mistaken, PanguWeather could then also be used to implement the randomized forecast strategy. It would be interesting to see this included in the model comparisons.

问题

How do you expect that the resolution difference between the Stormer and GraphCast/PanguWeather models affects the comparisons? Are you computing the losses at the native resolution of Stormer, downsampling the other models to it? Do you expect that it would change the results if you instead perform the comparison at the native resolution of GraphCast/PanguWeather and upsample the Stormer results to it?

局限性

The authors do briefly discuss the limitations of their approach and possible future directions. However, I think one point that gets glossed over by the authors is that due to its low resolution Stormer is unable to resolve weather features that models such as PanguWeather and GraphCast can resolve. Thus the performance gain of lower resolution comes at a cost of worse ability to resolve more localized weather features. It's also not clear how this impacts the model's representation of weather extremes, which are often found at small scales.

作者回复

2024-08-07

We thank the reviewer for the very detailed and constructive feedback, and for recognizing the technical contributions and good presentation of Stormer. We answer each of the reviewer's concerns below.

PanguWeather could then also be used to implement the randomized forecast strategy. It would be interesting to see this included in the model comparisons.

We agree that Pangu-Weather is capable of performing the randomized forecast strategy with the cost of training 3 separate models. We plan to include this experiment in the updated paper.

As a fairer comparison to Pangu, we have compared the non-ensemble version of Stormer with the baselines. Specifically, we performed the Pangu-style inference, where we only used the 24-hour interval forecasts to roll out into the future (i.e., 1-day=24, 2-day=24+24, 3-day=24+24+24, etc.), instead of combining different intervals.

Figure 1 in our PDF shows that non-ensemble Stormer outperforms Pangu and performs competitively with Graphcast. Moreover, we note that the ensembling technique in Stormer is much cheaper and easier to use than other methods such as training multiple networks, dropout, or IC perturbations, as we only have to train a single neural network and do not need extensive hyperparameter tuning. For better efficiency, one can always use the Homogeneous version of Stormer for inference, which only requires 3 forward passes and performs competitively to the Best m in n version, as shown in Figure 3 of our PDF.

How do you expect that the resolution difference between the Stormer and GraphCast/PanguWeather models affects the comparisons? Are you computing the losses at the native resolution of Stormer, downsampling the other models to it? Do you expect that it would change the results if you instead perform the comparison at the native resolution of GraphCast/PanguWeather and upsample the Stormer results to it?

We downsampled the forecasts of Graphcast and Pangu-Weather to 1.40625deg and compared all methods in this resolution. This is the same strategy that was used in WeatherBench 2 to compare different models at different resolutions. We expect the comparisons to change if we instead upsample the Stormer forecasts to 0.25deg. While this is technically possible, we (and Weatherbench 2) opted not to do so because upsampling introduces additional error due to the complexity of extrapolating information from lower to higher resolution. In contrast, downsampling is straightforward, effectively averaging neighboring pixels without significant loss of information.

2024-08-14

Thanks for the response, this clarifies my comments. I'll keep the original review score.

评论- Thank you

2024-08-14

We thank the reviewer again for the constructive review and feedback. We will make sure to include these discussions into the paper.

审稿意见

评分: 5置信度: 42024-07-10

The paper proposes a transformer-based model for weather prediction. Experiments show improvement in downstream predictions.

优点

The presentation of the paper is clean, and the paper is easy to read and understand.
Some improvements over long-term weather forecasting.

缺点

The paper has very limited novelty and is incremental. The transformer-based architecture is the same as ClimaX and other transformer-based methods; the variable embeddings are similar to what ClimaX proposes, multi-step fine-tuning is the same as FourCastNet, and pressure-weighted loss is by GraphCast.
The paper misses out on a ton of related works, such as ClimODE (https://arxiv.org/abs/2404.10024), Neural GCM (https://arxiv.org/abs/2311.07222), GenCast (https://arxiv.org/abs/2312.15796), etc all seem to be missing and comparisons to them are missing.
Recent works have shown that continuous time methods with physical inductive biases and hybrid ML modeling have surpassed transformer-based methods such as ClimODE and Neural GCM, providing uncertainty and interpretability. The proposed method does not offer any benefits to any of them.
Although the authors utilize randomized dynamic forecasting mechanism to have stability over predictions, they still restrict the whole method to output only point estimates rather than giving uncertainties.

问题

What is the computational complexity of training and inference? Are there any ablation studies regarding the compactness and inference time studies?
How do you counter for boundary conditions? Do the variable and patch embeddings respect boundary conditions?
Can the model accommodate different lead-time resolution query points (t = 1hr, 19hr, 91hr, etc.) that differ from the dataset, showcasing its generalizability and applicability in modeling weather?
Does the method use standard 2D ViT or a 3D transformer as the stormer block?

局限性

See weaknesses and questions.

评论- Rebuttal by Authors (2/2)

2024-08-07

What is the computational complexity of training and inference

During training, complexity is the same as training a single-interval model because we train a single model for the same amount of time. In other words, there is no computation overhead of randomized iterative forecasting during training. During inference, since we average multiple forecasts with different interval combinations, complexity scales linearly with the number of combinations. If computation is a critical issue, one should use the homogeneous inference of Stormer, which only uses 3 homogeneous combinations, while achieving competitive results to the Best m in n inference, as Figure 3 in our PDF shows.

How do you counter for boundary conditions? Do the variable and patch embeddings respect boundary conditions?

The model learns everything from data, and we do not enforce any special constraints on the model. This is similar to most deep learning methods like Pangu, Graphcast, etc.

Can the model accommodate different lead-time resolution query points (t = 1hr, 19hr, 91hr, etc.) that differ from the dataset

If we train the model on the 1-hour interval then it can produce forecasts at any lead time that is a multiplier of 1 hour. However, due to the huge size of 1-hourly ERA5, we subsampled the data to 6-hourly only, so the model can make forecasts at lead times that are multipliers of 6 hours. We do not expect a model to generalize well to an interval unseen during training.

Does the method use standard 2D ViT or a 3D transformer as the stormer block?

Stormer uses a standard transformer 2D backbone. The model relies on the weather-specific embedding module to aggregate information across different variables and pressure heights.

评论- Checking in

2024-08-13

Thank you again for your review. We made significant efforts to address the reviewer's concerns and sincerely hope that our responses have adequately addressed the concerns you previously raised. Since the discussion period is short and drawing to a close soon, are there further questions or concerns we should discuss? We understand that you are busy, and we truly value the time and effort you put into this process. Thank you in advance for your continued support.

评论- Checking in (2)

2024-08-14

We thank you again for your constructive review and feedback. We sincerely hope the reviewer has had time to read our rebuttal and additional experiments, which we believe have addressed and answered the reviewer's concerns and questions. As the discussion ends today, please let us know any further questions or concerns you would like to discuss. We understand that you are busy, and we truly value the time and effort you put into this process.

评论- Response to the authors

2024-08-14

Thanks for the detailed response and clarifications. I am adjusting my score. However, the method still looks like a minor adaptation of ClimaX with an adaptive layer norm.

2024-08-14

We thank the reviewer for appreciating our efforts and raising the score. We strongly agree with the reviewer that in terms of architecture, ClimaX and Stormer are very similar. However, we would also like to emphasize that this is the exact message we wanted to convey in this paper: with a carefully designed training recipe, we can significantly boost the performance of an existing architecture for weather forecasting. We hope that the experimental results and ablation studies in this paper will shed light on the impact of different design choices for future works in the field.

作者回复

2024-08-07

We thank the reviewer for the constructive feedback and the appreciation of the good presentation and good performance of Stormer. We answer each of the reviewer's concerns below.

The paper has very limited novelty and is incremental.

We acknowledge that some components in Stormer are similar to those of prior works. We also discussed the relation to them in the paper: Lines 204-205 discussed the variable embedding in ClimaX, Line 298 mentioned the pressure-weighted loss in Graphcast, and Lines 100-101 mentioned multi-step finetuning in previous works. We will update the paper to bring this discussion to the methodology section for better clarity.

However, we would like to emphasize the differences and contributions of Stormer:

Stormer has a similar architecture to climax, but we use adaptive layer norm for time-conditioning, which we show is important to the performance in Figure 3c. Even with a fairly similar architecture, Stormer significantly outperforms ClimaX, showing the superiority of randomized iterative forecasting vs continuous pretraining + direct finetuning.
We introduce randomized iterative forecasting, a paradigm not explored in previous works. This allows any architecture (GNN, transformers, etc.) to gain performance improvements w.r.t deterministic metrics with only minor computation overhead (during inference only) compared to single-interval models.
Unlike previous works, we carefully ablate each component in Stormer to understand their importance in obtaining a good forecast performance. Via this paper, we show that it may not be necessary for a specialized neural network architecture, and a standard model like transformers can achieve state-of-the-art performance with careful design choices. We believe this extends the current understanding of data-driven weather forecasting, and is a valuable contribution to the community.

The paper misses out on a ton of related works

Thank you for pointing us to these works, we will discuss them in detail in the updated paper. Both Gencast and NeuralGCM were concurrent works and used much larger compute resources to run on higher-resolution data. Moreover, our goal in this paper is not to show SOTA (even though we could potentially achieve better performance with more compute and higher resolution data), but rather to show the power of simple and scalable approaches based on the transformers architecture, and what contributes to the performance of an ML weather forecasting model.

Figure 4 in our PDF compares Stormer with NeuralGCM and ClimODE. We additionally include Stormer (IC noise), a version of Stormer with initial condition perturbations. Specifically, for each combination in the “Best 32 in 128” strategy, we sample 4 different noises from a Gaussian distribution with a standard deviation of 0.1 and add to the input, resulting in a total number of 128 ensemble members. The figure shows that Stormer outperforms deterministic NeuralGCM and performs competitively with NeuralGCM ENS. With IC perturbations, the gap between Stormer and NeuralGCM is negligible. ClimODE performs significantly worse than the other methods. While ClimeODE may improve by training on higher-resolution data, we do not believe it can close this huge gap. We will add this comparison to the updated paper.

Recent works have shown that continuous time methods with physical inductive biases and hybrid ML modeling have surpassed transformer-based methods such as ClimODE and Neural GCM, providing uncertainty and interpretability. The proposed method does not offer any benefits to any of them.

We respectfully disagree. NeuralGCM, ClimODE, and Stormer are different approaches and each has its pros and cons:

NeuralGCM is a hybrid model that combines a differentiable dynamical core with ML components for end-to-end training. While the dynamical core allows the method to leverage powerful general circulation models, it also has various drawbacks. First, to make predictions, the dynamical core in NeuralGCM has to solve discretized dynamical equations, which are more computationally expensive than forward-passing a neural network. Second, the performance of NeuralGCM is upper-bounded by the accuracy of the fixed dynamical core, while fully learnable models like Stormer continue improving with more data. This is a desirable property, given the scaling properties we have seen in Stormer and ClimaX.
ClimODE introduces physical inductive biases into deep learning models to improve the interpretability of their method. However, in terms of forecasting accuracy, as Figure 4 in our PDF shows, ClimODE is quite inferior to Stormer or other state-of-the-art methods.

They still restrict the whole method to output only point estimates rather than giving uncertainties.

This work focuses on deterministic forecasting and how to achieve state-of-the-art performance in deterministic metrics with a simple but scalable architecture like a transformer. However, we note that we can make Stormer a probabilistic model with IC perturbations. Figure shows the probabilistic performance of Stormer with different noise levels (standard deviations of the Gaussian noise distribution). IC perturbations significantly improve CRPS and SSR metrics of Stormer as well as deterministic performance at long lead times, but may hurt the accuracy at short lead times. Moreover, it is difficult to find an optimal noise level for the spread-skill ratio across different variables and lead times. We can further improve this by using a better noise distribution or variable-dependent and lead-time-dependent noise scheduling, which we defer to future works.

审稿意见

评分: 6置信度: 52024-07-14

The authors introduce a vision transformer-based method, Stormer, designed for medium-range weather forecasting. Ablations identify multiple important components of the method, including a weather-specific patch embedding, "randomized" dynamics forecast, and a pressure-weighted loss. The randomized forecasting component is similar to prior continuous models but complemented with iterative rollouts and an ensemble technique "best m in n" that exploits the multiple discretizations possible to forecast a specific lead time iteratively. The experiments show competitive results in terms of RMSE and ACC against deep learning and physics-based baselines.

优点

Tackles an important problem and achieves strong results on a popular weather forecasting benchmark, Weatherbench2.
The weather-specific patch embedding and randomized iterative ensemble forecasting are interesting methodological contributions.
Careful ablation of key design choices is insightful and valuable.
Clearly written, easy to follow, and a relatively simple approach are valuable for the community, especially if coupled with a good code release.

********************* After rebuttal: Raising score from 5 to 6.

缺点

The evaluation is somewhat unfair.

The paper is essentially using an ensembling technique but comparing against deterministic models. It is well known that ensembling improves ensemble-mean RMSE and ACC scores, especially for longer-range horizons (which is were Stormer, unsurprisingly, shines the most against the non-ensemble baselines). This can be actually clearly observed with the physics-based baselines IFS HRES (deterministic) and IFS ENS (ensemble mean) in Figures 8 and 9, where the ensemble mean shines the most on long-range horizons. Thus, a fairer comparison would be to either 1) ensemble the deterministic ML-based baselines (e.g. through input perturbations or with lagged ensembles as in [1]) or 2) show results of Stormer without ensembling (even a "m=1 in n" forecast would be much fairer than the way it is now).

Additionally, the proper way to evaluate an ensemble weather forecast is via probabilistic metrics such as the CRPS and spread-skill ratios. This should be included to properly assess the quality of the homogenous or "best m in n" ensembles. This probabilistic evaluation is actually supported by Weatherbench2, so it should be easy for the authors to extend their current evaluation with probabilistic metrics (+this will give you IFS ENS baseline up to 15 days ahead for free...).

On top of probabilistic metrics, it would be instructive to see an analysis of the generated spectra of Stormer.
The following is wrong: "it is unclear (...) how critical the multi-mesh message passing in GraphCast is to its performance". The authors seem to have missed section "7.3.1. Multi-mesh ablation" in the GraphCast paper.
I would like to see a more transparent discussion of the exact contributions of this work and a more careful contextualization with prior work. For example, in section 3.1 it would be good to be more candid about that 1) the objective is exactly the same as for a continuous forecasting model. The only difference between the two seems to be during inference time (and the different lead times used for training); 2) the overall train+inference method is essentially the same (neural architectures aside) as for Pangu-weather (PW) but with Stormer training one model for all lead times, while PW trains one per lead time. Noting the pros and cons of each would be useful to. Even more useful would be to have a crisp ablation for Stormer, where you drop the time-conditioning and train 3 separate models for each lead time (6, 12, 24hours). This would control for the differences in architecture between Stormer and PW, giving valuable insights into the pros and cons of each; 3) The pressure-weighted loss is taken from GraphCast but this is not properly referenced in the manuscript (this is not mentioned in section 3.1.1 that introduces it but only much later); 4) Same for multi-step fine-tuning, which is very common in prior work.
Relatedly, I think that the paper could benefit from discussing the related work more in-depth. For example, I seem to have missed any discussion that carefully compares Stormer with other (vision) transformer for weather forecasting methods such as Pangu-Weather, ClimaX, FengWu, etc.
Please include results on other variables than just T2M for your ablations. Different variables might behave quite differently... Also, this is a particularly unfortunate choice for Fig. 5b because the fact that the pressure-weighted loss improves T2M RMSE is fairly trivial since the pressure-weighted loss is very strongly upweighting the influence of T2M on the total loss relative to other variables.
Line 292: "We note that Stormer achieves this improvement with no computational overhead compared to the single-interval models, as the different models share the same architecture and were trained for the same duration.". Is this true? To me it seems that at inference time there does exist a computational overhead due to the larger size of the ensemble...

[1] A Practical Probabilistic Benchmark for AI Weather Models (https://arxiv.org/abs/2401.15305)

问题

It would be very insightful to see an ablation of the different prediction approaches in weather forecasting. That is, continous, iterative, and randomized iterative forecasting. Exploring their (dis-)advantages would be beneficial and very interesting. This would also tie well with the paper's desire to carefully ablate important components of the design stack.
Can you expand on how the "best m" combinations are chosen? By combinations you mean e.g. (3h+6h=9 and 6h+3h=9h, but not 9*1h?) Are you choosing them on the validation set? How do you choose the "n" combinations that you validate? Also, some lead times won't have as many combinations (e.g. 6h only has one, 12h two etc.), right? That would seem worth mentioning as a potential limitation of this ensembling technique.
Why does the time embedding module predict the second scale parameter? Why no second shift? Did you ablate this choice? What if you remove this second time embedding part (or the first)?
For completeness, consider specifying the parameter sizes of your models in section 4.3. explicitly. Also, do you discuss the exact hyperparameters used for the smaller model sizes?
Is input normalization done per variable or per variable AND pressure level?

局限性

The authors have not adequately addressed the limitations, mentioning the spatial resolution as the only limitation of their method (but also claiming it to be an advantage in other parts of the paper). A more upfront discussion of (potential) limitations would be welcome. For example, there is no discussion on how the proposed cross-attention-based patch embedding might scale to higher spatial resolutions or how it might impact inference speed.

作者回复

2024-08-07

We thank the reviewer for the very detailed and constructive feedback, and for recognizing the strong results and technical contributions of Stormer. We answer each of the reviewer's concerns below.

The paper is essentially using an ensembling technique but comparing against deterministic models.

We would like to clarify that even though Stormer uses an ensembling technique, we consider it a deterministic model, and that’s why we compare it with deterministic baselines. The reason is that while Stormer can produce multiple forecasts for a given lead time at inference, we found these forecasts not diverse enough and should not be used for uncertainty estimation. This is shown by the under-dispersion of Stormer w.r.t to the spread-skill ratio in Figure 2 of our PDF. We will add this discussion to the updated paper.

Thus, a fairer comparison would be to either 1) ensemble the deterministic ML-based baselines (e.g. through input perturbations or with lagged ensembles as in [1]) or 2) show results of Stormer without ensembling (even a "m=1 in n" forecast would be much fairer than the way it is now).

We agree these comparisons would provide more insights into the performance of Stormer. To do this, we have compared the non-ensemble version of Stormer with the baselines. Specifically, we performed the Pangu-style inference, where we only used the 24-hour interval forecasts to roll out into the future (i.e., 1-day=24, 2-day=24+24, 3-day=24+24+24, etc.), instead of combining different intervals.

Additionally, the proper way to evaluate an ensemble weather forecast is via probabilistic metrics such as the CRPS and spread-skill ratios. This should be included to properly assess the quality of the homogenous or "best m in n" ensembles.

To make Stormer a probabilistic forecast system, we need to introduce more randomization to the forecasts via IC perturbations. To do this, for each combination of intervals during the Best m in n inference, we added 4 different noises sampled from a Gaussian distribution, resulting in a total number of 128 ensemble members. Figure 2 in our PDF shows the RMSE and probabilistic metrics of Stormer with different standard deviations of the noise distribution.

The result shows that IC perturbations improve the probabilistic metric significantly, but may hurt the deterministic performance at short lead times. Moreover, it is difficult to find an optimal noise level for the spread-skill ratio across different variables and lead times. We can further improve this by using a better noise distribution or variable-dependent and lead-time-dependent noise scheduling, which we defer to future works. We will add these results and discussions to the updated paper.

On top of probabilistic metrics, it would be instructive to see an analysis of the generated spectra of Stormer.

We thank the reviewer for the suggestion and we will include this in the updated version of the paper. Due to the time constraint, we will defer this experiment to later. If the reviewer thinks this is crucial to assessing the paper, we will conduct this experiment during the discussion phase.

The authors seem to have missed section "7.3.1. Multi-mesh ablation" in the GraphCast paper.

We thank the reviewer for pointing this out. Our main message here is questioning whether we need a specialized neural network architecture for weather forecasting, or whether a standard architecture like transformers can work equally well. In this paper, we consider the Transformer architecture as a special reference due to its low inductive bias, good scaling properties, and great performance across different data modalities (text, image, audio, etc.). We will reword this part for better clarity in the updated version.

I would like to see a more transparent discussion of the exact contributions of this work and a more careful contextualization with prior work.

We initially wanted to describe our method first before connecting it with the related work, but we agree with the reviewer that it'd be better and more transparent to mention the prior works as we discuss each component of Stormer. We will update the paper accordingly. We answer the reviewer’s specific concerns as follows: 1) This is correct, but this seemingly small difference can make a huge difference in performance, as shown by the big gap between Stormer and ClimaX, as shown in Figure 4 of our PDF, 2) They are similar, but there are some nuanced differences. As the reviewer may have noticed, we train a single model for all intervals, while Pangu trains a separate model for each lead time. During inference, Pangu uses a single combination of intervals for each lead time that minimizes the rollout step, while we use a combination of them, 3) and 4) We agree with the reviewer and will add this discussion to the appropriate part of the main text.

评论- Rebuttal by Authors (2/3)

2024-08-07

Relatedly, I think that the paper could benefit from discussing the related work more in-depth.

In terms of architecture, Stormer and ClimaX both use a standard transformer backbone, and the only difference is Stormer uses adaptive layer norm for time-conditioning while ClimaX uses a simple additive embedding. On the other hand, Pangu-Weather and FengWu use a Swin-transformer backbone.

In terms of model training, ClimaX is pretrained with continuous forecasting but finetuned for direct forecasting. Pangu, Fuxi, and Fengwu are iterative models but with slightly different designs. Pangu trains a separate model for each lead time, Fengwu performs multi-step finetuning with a replay buffer, and Fuxi finetunes a separate model for each time range (short, medium, and long). We will include this discussion in the updated version.

Please include results on other variables than just T2M for your ablations.

Table 1 in our PDF shows the performance of the weighted and unweighted loss on 6 different variables, ranging from low to high-pressure levels. As expected, the weighted loss model achieves better accuracy for high-pressure variables while underperforming the unweighted version for low-pressure variables. We were aware of this trade-off and proposed to use the weighted loss in Stormer to focus on variables that are more important to forecasting and/or human activities. We will add these results to the updated paper for better clarity.

To me it seems that at inference time there does exist a computational overhead due to the larger size of the ensemble

We wrote this with regard to the training process, which is the major computation overhead of deep learning models. At inference, Stormer does require more computation compared to single-interval models, and computation scales with how many combinations of intervals we use during inference. And as we showed above, using 3 ensemble members of the homogeneous inference is enough to achieve a good performance. We will reword this part to avoid confusion in the updated manuscript.

It would be very insightful to see an ablation of the different prediction approaches in weather forecasting. That is, continous, iterative, and randomized iterative forecasting.

We expand the differences between these approaches below:

Continuous vs randomized iterative forecasting: The difference can be seen in the performance gap between Climax and Stormer. Continuous requires a single model to be able to forecast with a wide range of lead times, e.g., 6 hours to 2 weeks, which is a challenging (and sometimes confusing) learning task for any model. In contrast, randomized iterative only trains the model with a small set of intervals (6, 12, 24), mitigating this problem. Moreover, continuous models do not generalize beyond lead times in the training range, as shown in Climax, while randomized iterative does.
Iterative vs randomized iterative: The latter offers two advantages, data augmentation and the ability to combine the intervals to produce multiple forecasts. Empirically, the randomized iterative approach achieves significantly better accuracy than the iterative approach, as shown in the performance gap between Stormer and non-ensemble Stormer, while only incurring a slight computation overhead.

Can you expand on how the "best m" combinations are chosen?

By combinations, we mean from the set {6,12,24} pick any ordered combination that sums up to a certain lead time. For example, for a lead time of 1 day (24 hours), we can either do 6+6+6+6, 6+12+6, 24, 12+12, 12+6+6, 24, etc. We chose the best m based on the validation loss, and we picked n to be a reasonably large value (128) without hyperparameter tuning. Based on our preliminary results, the final performance is not sensitive to this value (we tried 32, 64, 128).

Indeed, there won't be many combinations for very small lead times, but these lead times also do not require ensembling multiple forecasts because individual forecasts are accurate enough. We will discuss this component in more detail in the updated paper.

Why does the time embedding module predict the second scale parameter? Why no second shift? Did you ablate this choice? What if you remove this second time embedding part (or the first)?

We did not ablate this choice. We adopted the adaptive layer normalization (adaLN) from the computer vision literature [1, 2], which is a common technique to condition a neural network on additional information like time.

[1] Perez, Ethan, et al. "Film: Visual reasoning with a general conditioning layer." Proceedings of the AAAI conference on artificial intelligence. Vol. 32. No. 1. 2018.

[2] Peebles, William, and Saining Xie. "Scalable diffusion models with transformers." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

评论- Rebuttal by Authors (3/3)

2024-08-07

For completeness, consider specifying the parameter sizes of your models in section 4.3. explicitly. Also, do you discuss the exact hyperparameters used for the smaller model sizes?

The total parameter count of Stormer is 400 million. Different model sizes of Stormer correspond to the standard sizes of ViT models in the computer vision literature, defined at https://github.com/huggingface/pytorch-image-models/blob/main/timm/models/vision_transformer.py. We will add the parameter count and hyperparameters of all models to the updated version.

Is input normalization done per variable or per variable AND pressure level?

Input normalization is done per variable AND pressure level.

评论- Checking in

2024-08-13

评论- Checking in (2)

2024-08-14

评论- Releasing code and checkpoints

2024-08-14

As a side note, we promise to publicly release all the code and checkpoints to the community.

2024-08-14

I appreciate the authors' comprehensive rebuttal and the additional insightful results provided in the PDF. The response has addressed most of my concerns, leaving only a key issue regarding the comparison of ensembles and non-ensembles (i.e. point predictions) that I touch on below. Based on this, I am raising my score to 6 and recommend accepting the paper. I apologize for the delayed reply, which was due to my being on holiday.

The non-ensemble Stormer results are particularly valuable. This version performs comparably or slightly below GraphCast and surpasses Pangu-Weather, which are impressive outcomes. I strongly recommend including this plot and its associated discussion in the main text.

My primary concern, which I urge you to address in your revised version, is the need for extreme caution when comparing your ensemble version with non-ensemble baselines, especially for setting a correct example for follow-up work. This relates to your statement:

We would like to clarify that even though Stormer uses an ensembling technique, we consider it a deterministic model, and that’s why we compare it with deterministic baselines.

While I understand your perspective, I have two reservations:

Ensembling typically improves results. Comparing point predictions to ensemble-mean results disadvantages the former. Although your ensembling technique is model-specific and interesting, a fair comparison would be against baseline ensembles (e.g., IFS ENS, stochastic models like NeuralGCM, diffusion-based models like GenCast, etc.) You do this in the rebuttal PDF and I would like to see the same level of rigor across your whole paper.
An ensemble-mean prediction, whether from Stormer or any other ensemble like IFS ENS, likely does not represent a physically realistic state. This is evident in the overly smooth predictions in Figures 12-16 and would likely be apparent in the spectra of these predictions compared to non-ensemble ones.

To address these concerns, I strongly recommend:

Clearly distinguishing between ensemble and non-ensemble results throughout the paper.
Providing separate comparisons: ensemble Stormer with other ensemble methods (e.g., IFS ENS) and non-ensemble Stormer with non-ensemble baselines. It's fair of course to show them all together in a plot (as in your rebuttal PDF), but I would be cautious about doing direct comparisons between ensemble predictions and point predictions which boil down to "the ensemble-mean is better".

These additions would significantly strengthen the paper and provide valuable context for future research in this field. I trust that the authors can revise their paper to reflect this.

Some minor comments:

To highlight differences between models on short-range forecasts (~1-3 days ahead), consider showing relative plots (e.g. relative to HRES for non-ensemble models, and ENS for ensembles).
Given the architectural similarity of Climax and Stormer, can the underperformance of Climax (for weather forecasting) be mostly explained by its training data?

2024-08-14

We thank the reviewer for the constructive and thoughtful feedback. We strongly agree with the reviewer that there is a clear distinction between non-ensemble and ensemble forecasting, and each approach has its pros and cons. We will update the methodology and experiment sections to discuss this in detail as the reviewer suggested. We would also like to answer the remaining comments below.

To highlight differences between models on short-range forecasts (~1-3 days ahead), consider showing relative plots (e.g. relative to HRES for non-ensemble models, and ENS for ensembles).

Thank you for the suggestion. We will include this plot in the revised paper.

Given the architectural similarity of Climax and Stormer, can the underperformance of Climax (for weather forecasting) be mostly explained by its training data?

We note that the ERA5 data used to train ClimaX and Stormer are very similar, and ClimaX was even pretrained with CMIP6 before being finetuned on ERA5. We believe the difference in performance mainly comes from the training recipe for each method. ClimaX was trained to perform direct forecasting for each lead time, while Stormer performs iterative forecasting with additional multi-step finetuning and pressure-weighted loss. The empirical results suggested that the latter approach excels in weather forecasting, which agrees with the current literature (Pangu, Graphcast, etc., are all iterative methods). This is also the main message we want to convey in this paper: with a carefully designed training recipe, we can significantly boost the performance of an existing architecture for weather forecasting.

2024-08-14

Thank you for the quick response. I think that the revised paper will be a great addition to the literature. I look forward to reading it.

作者回复

2024-08-07

We thank ACs for handling our paper and reviewers for their insightful comments and constructive feedback. The suggestions by the reviewers are very helpful and have added significant insights to the paper. We have responded to each review individually, and also submitted a PDF file containing the figures and tables for additional experiments we conducted during rebuttal. We summarize these experiments and their results here:

(Figure 1) Non-ensemble vs the baselines: We compared the non-ensemble version of Stormer with the baselines. Specifically, we performed the Pangu-style inference, where we only used the 24-hour interval forecasts to roll out into the future (i.e., 1-day=24, 2-day=24+24, 3-day=24+24+24, etc.), instead of combining different intervals. Figure 1 in our PDF shows that non-ensemble Stormer outperforms Pangu and performs competitively with Graphcast. Moreover, we note that the ensembling technique in Stormer is much cheaper and easier to use than other methods such as training multiple networks, dropout, or IC perturbations, as we only have to train a single neural network and do not need extensive hyperparameter tuning. For better efficiency, one can always use the Homogeneous version of Stormer for inference, which only requires 3 forward passes and performs competitively to the Best m in n version, as shown in Figure 3 of our PDF.
(Figure 2) Probabilistic performance of Stormer with IC perturbations: To make Stormer a probabilistic forecast system, we introduce more randomization to the forecasts via IC perturbations. To do this, for each combination of intervals during the Best m in n inference, we added 4 different noises sampled from a Gaussian distribution, resulting in a total number of 128 ensemble members. Figure 2 in our PDF shows the RMSE and probabilistic metrics of Stormer with different standard deviations of the noise distribution. The result shows that IC perturbations improve the probabilistic metric significantly, but may hurt the deterministic performance at short lead times. Moreover, it is difficult to find an optimal noise level for the spread-skill ratio across different variables and lead times. We can further improve this by using a better noise distribution or variable-dependent and lead-time-dependent noise scheduling, which we defer to future works.
(Figure 3) Homogeneous vs Best m in n inference: Figure 3 compares the two inference strategies we proposed in the paper. Homogeneous performs competitively to Best m in n and only underperforms at 13-day and 14-day lead times, while using fewer ensemble members. We recommend this strategy if efficiency is the priority.
(Table 1) Ablation studies with more variables: We showed the performance of Stormer with and without the weighted loss for 6 additional variables. As expected, the weighted loss model achieves better accuracy for high-pressure variables while underperforming the unweighted version for low-pressure variables. We were aware of this trade-off and proposed to use the weighted loss in Stormer to focus on variables that are more important to forecasting and/or human activities.
(Figure 4) Comparison of Stormer with additional baselines: We added 4 more baselines -- ClimaX, NeuralGCM, NeuralGCM ENS (mean) and ClimODE. Stormer significantly outperforms ClimaX, ClimODE, and NeuralGCM, while slightly underperforming NeuralGCM ENS (mean). With IC perturbations, the gap between Stormer and NeuralGCM ENS (mean) is negligible.

最终决定Accept (poster)

2024-09-25

This paper is about weather forecasting, in particular the authors propose a new (vision) transformer architecture that has several advantages over the state of the art: weather-specific patch embedding, randomized dynamics forecast, and pressure-weighted loss.

The main contribution of this paper is to show that a simple transformer architecture can obtain state of the art performance with simple settings, in particular with a careful transformer setup that is specific to weather forecasting.

Reviewers were positive, in particular noting that the paper is very well written and understandable, the simple approach, the important problem setting, ablation results, and improvements to the state of the art in long term weather forecasting. During rebuttal the authors provided additional results requested by the reviewers, in particular comparison with other baselines and a fairer comparison as Stormer is basically an ensemble model, a non-ensemble Stormer still performs very well compared to the state of the art.

To improve the paper I recommend to consider the (many) reviewer comments, in particular clarifying contributions, (conceptual) comparisons of the proposed method to previous methods in the literature, and incorporating the results presented in the rebuttal phase into the main paper or appendix.

This area chair also suggest to consider uncertainty estimation, as was also mentioned by reviewers, as it is an important component of weather forecasting, and since the method is partially an ensemble, simple uncertainty estimates could be extracted from it.

Given the positive contribution to the state of the art, and good opinion from reviewers, this paper should be accepted.