We thank the reviewer for their positive reception of our work and “strongly” supporting its publication at NeurIPS.

W1: That is an excellent point. We did not try running our method for 100 years. While our current simulations cover 10-year segments due to reference data limitations, we recognize the importance of longer-term stability. We are actively working on 100-year inference runs and aim to provide preliminary results during the rebuttal period, with a full analysis to follow in the revised manuscript.

W2: Thank you for the suggestion. We are eager to improve the estimate of the “noise floor”. For consistency with ACE, we follow their approach and complement it with the estimation of the “ensemble noise floor”. We will include a discussion of alternative methods for estimating the 'noise floor' in our revised manuscript, acknowledging the potential for improving the estimate using approaches based on the central limit theorem.

W3: We appreciate your suggestion. Our choice for the CRPS to measure the quality of the ensembles of time-means was mostly due to its popularity in the time series and, especially, weather forecasting literature. We agree that better approaches exist (as also discussed with reviewer u1ux) and will note so in our revised Fig. 8 and corresponding text in the Appendix.

Q1: Thank you for the great suggestion. We have created videos of two random sample 10-year trajectories by Spherical DYffusion and shared it with the AC (since we are not allowed to directly share links here)*. For the analyzed near-surface wind speed variable (a derived variable from the meridional and zonal wind predictions), we see promisingly realistic outputs compared to the validation climate model simulation.We would be happy to include these visualizations in the final paper.

*We have also included the corresponding snapshots as a set of images in our rebuttal PDF (in case the AC is not able to share the videos with you for any reason). If possible, we would encourage you to look at the videos.

Q2: We appreciate this feedback. We will remove the correlation computation from the figure.

Q3: Great points. We use the basic formulation of the CRPS as implemented in the python packages properscoring and xskillscore. We will make sure to clearly state this in our updated draft. Similarly, we did not include the correction factor when computing the spread-skill ratio. Note that for the 25-member ensemble, this correction factor is just 1.0198. We will fix this in our revised draft.

Q4: This is a great point! The training horizon, , is indeed an important hyperparameter. In our work, we briefly experimented with other horizons (3 and 9) at the initial stage of our project, but then decided to stick with for the following reasons: We believe that it sets a sweet spot between being too small and too large. When it is too small (e.g. 3), it immediately reduces the number of sampling steps for DYffusion and our method, since the reverse sampling process directly corresponds to the time steps which can lead to subpar performance. If it is too large, we run the risk that predicting from early time steps (e.g. based on ) is too challenging for the forecasting model. Additionally, the DYffusion paper used a similar horizon of for their sea surface temperature forecasting experiments, which is the data that is most similar to ours. We believe that using or similar values close to would probably work fine too, but we did not have the compute to run ablations to support this (every new choice of would require re-training two neural networks sequentially). We will add a discussion of this choice to our revised paper, acknowledging its importance and the rationale behind our selection.

Q5: Thank you for the reasonable suggestion, we will adapt our draft to reflect it.