RiverMamba: A State Space Model for Global River Discharge and Flood Forecasting
We present a novel deep learning model for global river discharge and flood forecasting
摘要
评审与讨论
RiverMamba forecasts river discharge up to 7 days in the future based on land surface variables, meteorological forecasts and static river attributes. The data is serialized with space filling curves and fed into a Mamba block. Paired with a sophisticated encoding and decoding, the method achieves promising results.
优缺点分析
Strengths
- Detailed description of the training procedure and model architecture
- Very good results compared to GloFAS (Global Flood Awareness System)
- Some engineering decisions were validated by ablation studies
Weaknesses
- The only machine learning model among the baselines is a LSTM. Space filling curves might be well-suited for Mamba, but it is not clear whether this is the right approach for a LSTM. (Graph)convolutional networks would be a more interesting baseline.
- Limited reproducibility (no code provided)
问题
Related work uses GNNs to model the spatio-temporal relationships. Why did you opt for space filling curves combined with Mamba?
局限性
yes
最终评判理由
See comment
格式问题
None
Thank you for your time and appreciating the detailed descriptions in the paper, the thorough ablation studies and the promising results.
The only machine learning model among the baselines is a LSTM. Space filling curves might be well-suited for Mamba, but it is not clear whether this is the right approach for a LSTM. (Graph) convolutional networks would be a more interesting baseline.
There seems to be some misunderstanding about the LSTM baseline. To train the LSTM, we followed the same protocol as originally proposed in Nearing et al (2024), which considers only temporal context but does not include any spatial connections. The space filling curves are thus not used in combination with the LSTM. As pointed out, an LSTM would also not be able to handle such long curves. The baselines are described in more details in the suppl. (page 17). In addition, we note that LSTM is the most popular backbone for river discharge forecasting. Recent GNN models for hydrology still use LSTM in their backbone for the temporal modeling e.g. (Gauch et al., "Towards Deep Learning River Network Models", EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-9768, 2025) and many studies use LSTM to resolve the spatial modeling e.g. (Yu. et al., "Enhancing long short-term memory (LSTM)-based streamflow prediction with a spatially distributed approach", Hydrol. Earth Syst. Sci. 28, 2024). Note that real-world gridded data like the GloFAS dataset has a spatial resolution of 7000 × 3200 (suppl. A.7 page 7) and we consider 4 time steps as input. A spatio-temporal graph would consist of 7000 × 3200 × 4 = 89,600,000 nodes, which would be computationally very expensive. To the best of our knowledge, such a GNN baseline does not exist since GNNs are only proposed in the literature at local scales or coarse resolution, i.e., basin level analysis. Note that we compare to a Transformer architecture with flash attention in Figure 6 and Tables 6 and 7 of the supp.
Limited reproducibility (no code provided)
Thank you for the comment. The code will be made publicly available upon publication as we mention in the NeurIPS Paper Checklist regarding ``Open access to data and code'' and in the suppl. on page 53.
Related work uses GNNs to model the spatio-temporal relationships. Why did you opt for space filling curves combined with Mamba?
The main motivation is that we aim toward developing a model that operates at grid scale and achieves a high accuracy. Mamba avoids the computational cost of the GNNs (see above clarification about GNNs) and Transformers (see Section F.1 in suppl. material). The space-filling curves allow the model to learn the spatial relations adaptively without the complexity of a graph structure.
Dear authors, thank you for the clarifications. Your points are understandable, but there are still ways to improve the baselines. E.g as you already implemented a sampling scheme for RiverMamba, it should be easy to adjust it to a scale where GNNs or CNNs work. Moreover, there are also other ways to aggregate data that can be used instead of space filling curves. I will stay at the rating as it is a very solid and interesting paper, but there is some room for improvement concerning the baselines.
Dear Reviewer ugtP,
We sincerely appreciate your feedback. Your recognition that our work is very solid and interesting is highly encouraging for us.
In the paper, we compared to 6 baselines: LSTM, Climatology, Persistence (Section. 4.1), GloFAS (Section. 4.2), and Transformer with Flash-attention and Mamba2 (Supplementary. Section F.1). Since there is no GNN or CNN-based model that we can directly compare to, adding an additional GNN or CNN baseline requires building a new model from scratch and adapting the input data. This is beyond the scope of this paper. We believe that the comparison to the state-of-the-art baselines is sufficient to validate the proposed approach.
The authors frame the streamflow, and flood forecasting problem as image level tasks with each pixel being mapped to a unique geographical location, covering 5 arc-seconds in both directions. In this setting, they propose an encoder-decoder architecture where the encoder integrates past spatio-temporal maps in conjunction with static information into a hidden representation. Using this encoded information alongside forecasted spatio-temporal data, the decoder predicts for each point on the map the changes to the river discharge values relative to the mean of the encoder input data. To produce predictions at scale, the authors propose the usage of Mamba layers with their linear space and time complexity. A set of space-filling curves serialise these geographically organised data. Through alternation of these curves, different spatial points will be close in the serialised sequence, helping the model to learn different spatial correlations. A novel normalisation layer facilitates the incorporation of static attributes, like the river channel network, into the hidden representation at each step in the encoder. It does that by linearly transforming the static attributes, activating them using GeLU and additively overlaying these activations with the z-score normalised hidden representation at that point. With this architecture, the authors demonstrate superior performance compared to existing state of the art models (Google's operational LSTM) as well as physics-based models (GloFAS) when evaluated on both reanalysis and observational data. This performance increase is measured on multiple metrics, , KGE, F1-score, on a global forecasting scale. Notably, RiverMamba is the first architecture capable of producing global river-discharge forecasts with 7 day lead time on a spatial resolution of 5 arc-seconds.
优缺点分析
Originality
Applying state space models—including the necessary usage of space filling curves for serialisation—to the global hydrological forecasting task is, to the reviewer's knowledge, novel. Repeatedly adding linearly transformed and GeLU activated static attributes to the normalised hidden embeddings provides a clever way of informing the model of static spatial relationships. Additionally, their composed weighting factors that balance the recency of the input with the return period represent a thoughtful approach to addressing the dual challenges of temporal decay in forecast reliability and the relative importance of rare flood events in the training objective.
Quality -- Strengths
The authors provide a thorough evaluation of RiverMamba across multiple dimensions, including systematic ablation studies on location embedding approaches, space filling curve alternatives, weighting factors in the objective function, the impact of exchanging static attributes from LISFLOOD-derived to HydroRIVERS-derived sources, input feature importance analysis, and the effects of pretraining versus training from scratch. In addition to temporal out-of-sample settings, the extended evaluation in the appendix also tests spatial generalization and compares the model not only against the leading data-driven model from Google but also against the physics-based model GloFAS. Furthermore, the authors effectively contextualize their work by contrasting related approaches in each section against their proposed methodology, demonstrating a clear understanding of how their contributions advance the field.
Quality -- Weaknesses
The model produces only point forecasts without uncertainty estimates. These, however, provide important additional information decision makers can use in the operational flood forecasting context. The ablation studies mentioned in the main text and summarised in Table 2 leave out one possible combination preventing assessment of any separate impact of one hyperparameter from possible synergistic effects. Specifically, sub-tables (a) and (b) lack results for using only the recency factor or LOAN exclusively in the forecast block. This omission prevents assessment of whether the observed performance improvements are independent effects or result from synergistic interactions between the return period weighting and LOAN placement in different blocks. The authors do not provide inference time comparisons on the global scale, particularly against the LSTM baseline, which would help assess the practical computational advantages claimed for the Mamba architecture. Additionally, there is a partially incorrect statement about Flash-Attention in the appendix (line 261) claiming linear complexity; while this accurately describes the space complexity improvement, Flash-Attention's time complexity remains quadratic in the input length, which undermines the claimed efficiency comparison.
Clarity
The main part of the paper clearly describes the key components of the architecture, highlighting the intended effects of each design element. However, some presentation issues hinder clarity. The sentence spanning the lines 137-140 is overly long and complex, making it difficult to follow on first reading. The relationship between the number of hindcast layers and the chosen temporal input dimension is not clearly explained in the main text, though the former seems to be dependent on the latter. Additionally, Figure 4's layout is somewhat ambiguous -- it is difficult to discern whether the first serialisation layer belongs to the hindcast block, and including it within the outer shaded region would clarify this organizational structure. Finally, while the appendix describes how the authors set up and trained a version of Google's LSTM, it does not explicitly clarify how this relates to the LSTM results shown in the figures versus results obtained directly from Google's system.
Some typos were found during reading the paper including the appendix:
- Main paper: Line 65 "saptio-temporal" -> "spatio-temporal"
- Appendix: Line 167 "see" -> "sea"
Significance
The model demonstrates significant practical value by effectively detecting very rare floods that might be missed by existing LSTM approaches as impressively shown in Figure 47 for example. However, the model's spatial generalization capabilities appear limited based on the results in appendix section J, where performance on spatially out-of-sample ungauged basins does not significantly differ from the LSTM baseline and regresses quickly with increasing return periods.
问题
- Does weighting of no-flood events equally in equation (8) to the ones with a one year return period not skew the results to the no-flood case? Weighting floods not just by their return period but also augmenting it with an additive flood offset (let's say 1) might help the flood scarcity problem a bit, which also has been mentioned in the conclusion.
- Could you provide computational time comparisons to better position the model in the landscape of operational forecasting along Google's LSTM and GloFAS?
- Is there an architectural reason for not training on quantile regression, thereby providing a way of estimating the uncertainty of the model's predictions?
- Could you evaluate the model's performance with larger temporal gaps between training and test data (e.g., 5-10 years) to assess robustness to climate change impacts on flood drivers? This would help establish the model's long-term operational viability.
局限性
yes
最终评判理由
I thank the authors for their thorough response and for running additional experiments, which have clarified several points. However, as my concerns regarding the ablation studies and spatial generalization limitations persist, I will maintain my initial rating.
格式问题
None
Thank you for the detailed and clear summary you have provided about this work. We are glad that you found our work novel and the evaluation thorough.
The model produces only point forecasts without uncertainty estimates. These, however, provide important additional information decision makers can use in the operational flood forecasting context.
As mentioned on page 9, extending the model to output a probabilistic forecast is an interesting future direction.
The ablation studies mentioned in the main text and summarized in Table 2 leave out one possible combination preventing assessment of any separate impact of one hyperparameter from possible synergistic effects. Specifically, sub-tables (a) and (b) lack results for using only the recency factor or LOAN exclusively in the forecast block.
There might be some misunderstanding about Table 2. Table 2 represents actually 3 distinctive tables (a), (b), and (c). Table 2 (b) evaluates exclusively the impact of LOAN. We will revise it to make it clearer.
There is a partially incorrect statement about Flash-attention in the appendix (line 261) claiming linear complexity; while this accurately describes the space complexity improvement, Flash-attention's time complexity remains quadratic in the input length, which undermines the claimed efficiency comparison.
Thank you for the note. We will correct it and make it more clear if we refer to runtime or memory. In Fig. 6 in the appendix, we compare both.
The sentence spanning the lines 137-140 is overly long and complex, making it difficult to follow on first reading.
We will revise the sentence.
The relationship between the number of hindcast layers and the chosen temporal input dimension is not clearly explained in the main text, though the former seems to be dependent on the latter.
Thank you for this comment. In our implementation, we chose T=4 as for the GloFAS operational system and a temporal down-sampling of 2. Consequently, we defined 3 layers to encode the input (the first layer doesn't use down-sampling). For example, when T=8 and down-sampling is 2, one would need 4 hindcast layers. The Table below shows the temporal resolution for each layer w.r.t. the input temporal dimension T.
| Layer | ||||
|---|---|---|---|---|
| T=2 | 2 | 1 | ||
| T=4 | 4 | 2 | 1 | |
| T=8 | 8 | 4 | 2 | 1 |
Figure 4's layout is somewhat ambiguous -- it is difficult to discern whether the first serialisation layer belongs to the hindcast block, and including it within the outer shaded region would clarify this organizational structure.
Thank you for this suggestion. We will modify the figure accordingly.
While the appendix describes how the authors set up and trained a version of Google's LSTM, it does not explicitly clarify how this relates to the LSTM results shown in the figures versus results obtained directly from Google's system.
All results shown in experiments are based on the trained Google's LSTM on our dataset except the results shown in suppl. section J pages 23-28 which are obtained directly from the published Google reforecast. We will make this clear in the paper.
Some typos were found during reading the paper.
Thank you for the careful reading. We will fix the typos and check the paper for any other ones.
The model demonstrates significant practical value by effectively detecting very rare floods that might be missed by existing LSTM approaches as impressively shown in Figure 47 for example. However, the model's spatial generalization capabilities appear limited based on the results in appendix section J, where performance on spatially out-of-sample ungauged basins does not significantly differ from the LSTM baseline and regresses quickly with increasing return periods.
As described in the suppl. section J, there are plausible reasons behind this observation:
We do not use nowcasting, i.e. weather forecast at , as input. We expect that this would substantially improve the results at early lead time and it is the main reason for the close performance at early lead time. Other differences are: The Google model (Nearing et al. 2024) was trained on () stations compared to () for RiverMamba. The ungauged streamflow forecast becomes better when the number of stations increases. There are also differences in the input initial conditions, i.e., Google's LSTM uses precipitation estimates from the NASA Integrated Multi-satellite Retrievals for GPM (IMERG) early run as input. In addition, Google's LSTM uses an ensemble of three separately trained LSTMs (Nearing et al. 2024). The results in Section J are thus not directly comparable, but we added them for completeness. We expect that using nowcasting and IMERG as input and an ensemble would improve the results of RiverMamba on ungauged basins further. For the other results in the paper, we used the same training and evaluation setup for all methods to ensure a fair comparison.
Does weighting of no-flood events equally in equation (8) to the ones with a one-year return period not skew the results to the no-flood case? Weighting floods not just by their return period but also by augmenting it with an additive flood offset (let's say 1) might help the flood scarcity problem a bit, which also has been mentioned in the conclusion.
As suggested, we conducted an additional experiment. To this end, we trained a model with weighted flood events by their return periods + flood offset of 1. The F1 results are shown below for the reanalysis dataset and different return periods:
| Return period | 1.5 | 2.0 | 5.0 | 10.0 | 20.0 |
|---|---|---|---|---|---|
| Validation (2019-2020) | |||||
| w/ offset | 48.20 | 37.60 | 23.58 | 17.90 | 11.81 |
| w/o offset | 48.70 | 37.67 | 25.16 | 20.15 | 12.08 |
| Test (2021-2024) | |||||
| w/ offset | 61.14 | 50.80 | 31.25 | 24.86 | 16.56 |
| w/o offset | 61.22 | 50.72 | 30.31 | 24.34 | 16.69 |
Adding an offset does not improve the results.
Could you provide computational time comparisons to better position the model in the landscape of operational forecasting along Google's LSTM and GloFAS?
Neither Google (Nearing et al. 2024) nor GloFAS (Harrigan et al. 2023) provided the compute time for the operational forecast. The inference time for RiverMamba is reported in supp. Fig. 6. In the table below, we report the inference time (seconds) w.r.t. the number of input points for our model with 4 days as a hindcast (first row), a trained version of Google's LSTM with 4 days as a hindcast (second row), and a Google's LSTM version as in Nearing et al. (2024) with one year hincast (third row). We use one A100 GPU for all runs. All machine learning approaches are very fast. We expect that GloFAS is by several magnitudes slower, which is a practical advantage of machine learning approaches for this task.
| Model | 10K | 20K | 40K | 80K | 160K | 300K | 600K | 1500K |
|---|---|---|---|---|---|---|---|---|
| RiverMamba | 0.026 | 0.044 | 0.086 | 0.190 | 0.423 | 0.874 | 1.914 | 4.739 |
| LSTM | 0.005 | 0.009 | 0.015 | 0.027 | 0.053 | 0.098 | - | - |
| LSTM (one-year hindcast) | 0.069 | - | - | - | - | - | - | - |
Is there an architectural reason for not training on quantile regression, thereby providing a way of estimating the uncertainty of the model's predictions?
There is no architectural reason that prevents the training using a quantile loss. An ensemble of different models can be trained with different quantiles. We agree that an ensemble and estimating the uncertainty of the model are an interesting direction for future work, but we focused on deterministic forecast to thoroughly analyze and evaluate the design of the network architecture. We think that the results demonstrate the strong capabilities of RiverMamba for large scale river discharge forecast. Using an ensemble not only increases the computational cost for running all ablation studies, but it also requires a different evaluation framework, e.g, adding Continuous Ranked Probability Score (CRPS) and spread/skill ratio metrics. Given that we have already 54 pages for the suppl. material, adding an ensemble and uncertainty estimation is out of scope, but an interesting future research direction.
Could you evaluate the model's performance with larger temporal gaps between training and test data (e.g., 5-10 years) to assess robustness to climate change impacts on flood drivers?
In the Table below, we report the evaluation for each year in the test dataset separately. The model is trained over Europe for the years 1979-2018 and shown are averaged (KGE F1) for the return periods 1.5-20 years for both reanalysis and GRDC data.
| Year | 2019 | 2020 | 2021 | 2022 | 2023 | 2024 |
|---|---|---|---|---|---|---|
| Reanalysis | 90.80 30.09 | 91.75 20.57 | 90.32 24.42 | 90.76 20.54 | 91.05 34.57 | 90.41 38.66 |
| GRDC (obs) | 76.63 10.77 | 76.31 12.74 | 69.33 11.79 | 71.41 08.94 | 71.73 19.88 | 52.32 14.53 |
Note that the evaluation is highly dependent on the ratio of the flood events that happened during a specific year and on the dataset. On the observational data the modeling of the discharge time series (KGE) decreases with time while the model stays robust regarding flood detection (F1). This is probably due to anthropogenic intervention, as mentioned in the manuscript. Reanalysis data has more data points to train on them and the model shows a good generalization ability with a larger gap between training and testing.
I would like to thank the authors for addressing all noted points in my review, clarifying how the model hyperparameters regarding the number of layers were chosen and for running the additional tests. The experiments on the proposed offset were insightful.
I would like to clarify my question regarding the ablation studies in Table 2, as my initial phrasing may have been unclear. My concern is that for both Table 2a and 2b, a specific experiment is missing that would allow for a complete assessment of individual versus synergistic effects. To fully understand the source of the performance gains and rule out potential synergistic effects, I believe two specific ablations are needed:
- In Table 2a, a result for using only the recency factor in the objective function.
- In Table 2b, a result for using the LOAN layer in the forecast block only.
These additions would make it possible to isolate the individual contribution of each component.
I would like to thank the authors for addressing all noted points in my review, clarifying how the model hyperparameters regarding the number of layers were chosen and for running the additional tests. The experiments on the proposed offset were insightful.
We appreciate your insightful feedback, which has significantly improved the clarity and completeness of the paper. We are very glad that all noted points were addressed.
I would like to clarify my question regarding the ablation studies in Table 2, as my initial phrasing may have been unclear. My concern is that for both Table 2a and 2b, a specific experiment is missing that would allow for a complete assessment of individual versus synergistic effects. To fully understand the source of the performance gains and rule out potential synergistic effects, I believe two specific ablations are needed:
- In Table 2a, a result for using only the recency factor in the objective function.
- In Table 2b, a result for using the LOAN layer in the forecast block only.
These additions would make it possible to isolate the individual contribution of each component.
Thank you for this further clarification. We indeed misunderstood your question. As suggested, we conducted the additional experiments as shown in the tables:
Table (a): Objective function
| KGE F1 () | ||
|---|---|---|
| ✗ | ✗ | 90.86 22.36 |
| ✓ | ✗ | 91.27 28.59 |
| ✗ | ✓ | 91.36 25.93 |
| ✓ | ✓ | 92.05 28.75 |
Table (b): Location embedding
| LOAN | LOAN | KGE F1 () |
|---|---|---|
| ✗ | ✗ | 91.83 27.90 |
| ✓ | ✗ | 91.60 28.27 |
| ✗ | ✓ | 91.66 29.31 |
| ✓ | ✓ | 92.05 28.75 |
Using only the recency factor (Table (a) third row) improves the results on both KGE and F1 metrics. In Table (b) third row, using LOAN layer in the forecast blocks increases the F1 score while decreases KGE. The combination of layers in both hindcast and forecast blocks balances the metrics.
We will include these additional ablations in the final version for completeness.
This paper introduces a genuinely interesting forecasting approach that builds upon a global hydrological model, called GloFAS, by integrating information over a (extremely) wide space. Traditional approaches always orient themselves on the physical conception of catchments and hence try to physical delineation thereof. This means that a lot of spatial information is not accounted for here. This is where RiverMamba comes in. It leverages state space models to integrate the spatial data pixelwise (and avoids paying the quadratic cost per additional pixel) and build a new forecast out of the existing reanalysis.
优缺点分析
Albeit I do in the following mainly focus on the weaknesses, I would like to emphasize that the overall contribution is — in my view — quite strong. The idea is good, the evaluation is sufficient for what the authors want to show, and ablations are thorough. Still, I would like to start my critique with a meta-point: The current form of the paper is rather hard to read. At some points it is even confusing. Many parts are not explained well enough and readers need to make inferences about what actually happens. For example, it is unclear how the LSTM baseline is trained and used. Is the LSTM baseline also trained on sequences of multiple hundred thousand timesteps (if yes, this is far beyond what LSTMs usually can work with)? Similarly, it took me unnecessarily long to understand that the model is still trained and used with only a small-subset of the available pixels. In short, I think with some love the readability of the manuscript could be greatly improved. As a consequence most of my critique consists of what I would normally consider minor comments or so. As a matter of fact even my major point is related to language:
My main point is that with the current formulation readers will get the impression that the LSTM baseline is the one from Nearing et al. (2024). But, as far as the provided description lets me infer, the one used as baseline has nothing in common with it besides the general architecture (decoder-encoder LSTM). The task of the LSTM in the original publication is to integrate meteorological information over longer time-horizons and provide a streamflow simulation. In contrast, the task from RiverMamba is to provide a post-processor for such a streamflow simulation model — namely GloFAS. As a matter of fact, the LSTM from Nearing et al (2024) could very well be used as a stand-in backbone for the hydrological model, which would provide a more apt comparison. As an example of how the used language can be misleading take Table 3. First, an LSTM as proposed in Nearing et al. (2024) in a gauged setting commonly achieves median of in the test period. Second, most results and simulations provided by Nearing et al. 2024 are for an ungauged setting (i.e., the GRDC stations were not included in the training), which, if compared to gauged, is obviously worse. Similarly, different sequence lengths can be used if you feed in the states (snow, soil moisture) and current runoff to the model compared to a model that has to simulate everything (e.g., snow accumulation during the winter to predict snow melt in the spring/summer) just from weather inputs. This is because the states and the streamflow integrate the meteorological signal of the past already. Hence, sentences like the following example can easily give the wrong impression:
“In [6], an input sequence length of 365 days was used. In our experiments, we also trained the LSTM model using a range of input sequence lengths from 4 to 90 days. We observed only marginal performance gains beyond a certain point, and identified 14 days as an optimal input sequence length.”
问题
I happily increase my score if the rebuttal addresses my questions and improves on the readability aspect.
- I do not find the motivation for a global receptive field to account for uncertainty in the meteorological variables is convincing. Why would ERA5-Land runoff, snow and soil moisture in a river of Siberia be a useful feature for a short-term (7-day) forecast in a reach of the Amazon River. While there is uncertainty in the precise location (and occurrence/magnitude) of precipitation forecasts, there is certainly no real-world meaning in such a receptive field. Additionally, why would a pixel in their sequence of p in P somewhere in Siberia be more informative to river discharge in a reach of the Amazon than the direct neighbors and upstream pixels in that basin (many of which are not included due to the random sampling). P denotes not even 5% of all landmass pixels. This seems particularly off since the localized ablation suggests for me that mostly local information is used to get high predictive capacity. Hence, I would recommend to the authors to instead frame the whole in a more traditional machine learning fashion. Something in the line of: Traditional models, like GloFas do not account for wide ranging spatial correlation. We can exploit that with Machine Learning. To do so we use a lot of data and let the model learn the spatial correlations rather than to set them a-priori (in this regard, you might want to refer to the bitter lesson or something too).
- The paper emphasizes in multiple passages that RiverMamba can be used in an operational setting (as a matter of fact, the comparison is made against two operational models). However, the model relies on ERA5-Land which is not available in real-time but has a ~30 day lag.
- L44. Not entirely true. a) The Google Model runs globally at 300k locations (but only a few thousand are verified to be good). b) Yang et al. already have a global gridded model.
- L89 There is a quite some literature missing (e.g. 1, 2, 3, 4, 5 to name a few)
- L121. This must be a misunderstanding from my side but why is RiverMamba not including any information from the nowcast timestep but only up to t-1 and then starting on t+1?
- L123. I struggle to understand the sentence about the “additional” shift to GloFAS, ERA5, and CPC. According to your notation, you only consider data until t-1, with t being the date that the forecast is issued. Why another shift by 2 days for CPC and 1 day for GloFAS and ERA5?
- Figure 4 appears before Fig 3 (which is interesting, I did not even know that Latex allows for that)
- Eq 2. It does seem to met that the dimensions listed in the text below do not work with this equation?
- Eq 2. Why GELU? Have you tried different activations functions or is there any motivation to it? The LOAN reference does not (yet) seem to use GELU.
- In general, but I noted it especially around L236ff: The terminology is a bit off. There is a difference between a flood and an event of a given return period. A 1.5 year return period does not necessarily mean any flooding. In most places in North America/Europe, not even a 10/20 year return period should necessarily mean flood.
- L238. where do these ranges come from? Is something you chose?
- L247. I couldn’t find the results on the ablation study of their loss weighting.
- Suppl. L190. In many operational settings you compute return periods separately in the observed space (i.e. long-term records of streamflow measurements) and in model climatology (i.e. long-term reanalysis/reforecasts of the model). The important point to measure here is if the model predicts the correct return period and not, how well the predictions match the observations. For example imagine your model gives perfect predictions but shifted down by 100 cms. For flood forecasting, to know when to trigger an alert, it is only important to know if your model thinks the upcoming event will be larger than some threshold (usually in return period space). If the return period thresholds are computed separately for observations and model climatology, you would get 100% accuracy with such a model, even though your absolute predictions are constantly off by 100 (which is not important if your task is operational flood warnings/forecasts).
- Suppl. Tab 5. The table is still showing P=245954 for GRDC finetuning. Is the loss only computed for those p in P that are associated to a GRDC station or also on all other points (here then on dis24 from GloFAS)?
- Suppl F2. How can Glofas be dropped here?
Lastly, I am mainly mentioning the following because it came up in some other reviews that I was involved in and because Fig 13 in the Supplementary. This figure shows F1 scores over lead time for different return periods. If I understand the explanations correctly, this should essentially be the same information as Fig3 in Nearing et al. (2024). However, the figures show very different values. There might be misalignment in the data when computing the metrics, since Nearing et al. (2024) provided the data right-labeled, while GRDC data is left-labeled. E.g., note from Zenodo:
“Model outputs are daily and timestamps are right-labeled, meaning that model outputs labeled, .e.g., 01/01/2020 correspond to streamflow predictions for the day of 12/31/2019.”
局限性
yes
最终评判理由
This contribution is technically excellent and proposes an approach that does not only have the potential to improve hydrological predictions but land-surface models in general (and is perhaps even of interest for weather forecasting). My concerns about the paper where from the start about language and positioning --- and the authors very good job at explaining how they will improve the papers in this regard.
格式问题
Adressed in my questions
Thank you for the detailed review and the thoughtful feedback. We appreciate your time and we are glad that you found our approach interesting and the contribution strong.
It is unclear how the LSTM baseline is trained and used.
We describe the details of the LSTM baseline in Section G.3 of the suppl. As in Nearing et al. (2024), we do not use spatial connections for the LSTM. The sequence length for the LSTM is 14 days, which performed better than using a longer sequence length. We will state this clearly in the revised version.
The LSTM baseline from Nearing et al. (2024).
The original model from Nearing et al. (2024) is not publicly available. We thus built the encoder-decoder LSTM from the neuralhydrology repository and followed the model structure as described in Nearing et al. (2024). The LSTM model is trained similarly to Nearing et al. (2024) but on a gauged setting (Section G.3 in Suppl.). For a fair comparison of the methods, we also use the same input to the LSTM model and RiverMamba, which is different to Nearing et al. (2024). We will clarify this in the revised manuscript and in the tables, in particular in Table 3. We do not think that a model should achieve a median . The metric is affected by many factors like the length of the evaluation set (length of the extrapolation), lead time, input data, number of gauged points, resolution, region of study, etc.
Different sequence lengths can be used if you feed in the states and current runoff to the model compared to a model that has to simulate everything just from weather inputs.
Thank you for this clarification. We will integrate this into the manuscript to make it clearer.
The motivation for a global receptive field.
We will revise the formulation since the term "global" might be misleading. As mentioned in the suppl. material on page 16, we split the Earth into smaller domains with 311K nearest points for each. We thus do not connect points from the Amazon to points in Siberia. However, we give the network the possibility to consider a very large spatio-temporal context. For instance, river networks like Amazon can be very large and providing a model that can handle a very large spatio-temporal context is thus very intuitive.
Finally, we present a general backbone model that is not limited to medium-range river discharge forecast and specific input variables or task. For instance, the model can be used for seasonal discharge forecasting. Furthermore, weather forecast and climate modeling generally require a large receptive field to preserve the dynamic over lead time and RiverMamba is a promising backbone for such tasks.
The model relies on ERA5-Land which is not available in real-time but has a day lag.
Please kindly note that ECMWF states: "The ERA5-Land dataset is available for public use for the period from 1950 to 5 days before the current date." ... "The ERA5-Land-T version delivers non-checked close to Near-Real-Time (NRT) daily updates. ERA5-Land-T is synchronized with the close to NRT daily updates provided by the ERA5 climate reanalysis (ERA5T).", so there is no risk of having a 30-day lag. In the RiverMamba framework, ERA5-Land can be also replaced by an analysis or ERA5 similar to GloFAS. We will revise the manuscript to make this issue clear.
L44. Not entirely true. a) The Google Model runs globally at 300k locations. b) Yang et al. already have a global gridded model.
a) We will rephrase the sentence. We wanted to highlight here the limited spatial modeling of the LSTM.
b) Please see L91-L93 and ref [60], we already discuss the preprint version by Yang et al. We will update the reference and text based on the published version. Thank you for the note.
L89 There is a quite some literature missing.
Thank you for this note. We will briefly discuss the mentioned methods in the related works. Kindly note that the first reference [1] is already mentioned in the manuscript.
L121. Why is RiverMamba not including any information from the nowcast timestep?
RiverMamba intentionally excludes the analysis data (t=0) from the input, and instead starts forecasting from t+1 using only past data up to t–1. The rationale behind this design is to ensure broader applicability, since many weather forecast systems especially ML models provide only lead time and we wanted to make the model more generic so it can work without nowcasting or a specific type of forecast (IFS-HRES). However, adding nowcasting to the model is straightforward and can be added optionally. We will mention this explicitly in the manuscript.
L123. Why another shift by 2 days for CPC and 1 day for GloFAS and ERA5?
We will revise this sentence and remove the word "shift" since we do not shift data. To our knowledge, CPC data are available with 2 days lag. Consequently, we only consider data until t-2.
Figure 4 appears before Fig 3.
We will correct it.
The dimensions in Eq 2.
Thank you for this note. We will correct it. PyTorch handles the dimensions for different operations, i.e., and will be duplicated times along the last dimension to match .
Eq 2. Why GELU?
We were motivated by two things: first, GELU avoids the dying ReLU problem and improves optimization. Second, it has been shown that a learned activation function initialized as ReLU, ends up in a smooth variant shape similar to GELU (Teney et al., "Do We Always Need the Simplicity Bias? Looking for Optimal Inductive Biases in the Wild", CVPR'25). We conducted an additional experiment (see Table below). The model is trained over Europe for the years 1979-2018 and tested on the validation set (2019-2020). Shown are (KGE F1) averaged for the return periods 1.5-20 years on reanalysis data.
| ReLU | GELU |
|---|---|
| 91.43 28.33 | 92.05 28.75 |
Using GELU performs slightly better.
L236ff: The terminology is a bit off.
Thank you for pointing out the imprecise usage of terminology. We agree that a return period e.g., 1.5 years does not necessarily imply actual flooding in all regions, particularly in highly regulated or flood-resilient areas. A high return period event simply reflects statistical rarity in streamflow magnitude, and should not be equated with a flood event without additional context (e.g., thresholds, inundation). In the revision, we will clarify it and emphasize that the return period is used as a proxy indicator of hydrological extremity, rather than a direct definition of flood severity.
L238. where do these ranges come from?
The return periods are used in GloFAS and operated by ECMWF.
L247. I couldn’t find the results on the ablation study of their loss weighting.
The ablation study regarding the weighting is provided in Table 2 (a).
Suppl. L190. In many operational settings you compute return periods separately in the observed space and in model climatology.
Thank you for this insightful comment. You are correct that in many operational settings, thresholds are computed from the model’s own climatology (e.g., long-term reforecasts), and what matters is whether the model correctly identifies exceedance relative to its own statistical distribution, rather than matching the exact values of the reference. As described in Suppl. L190–L194, we calculated return period thresholds separately from both the GRDC observations and the GloFAS reanalysis. This allows us to evaluate each dataset relative to its own climatology. However, we didn't compute return periods from the trained ML models reforecast as this would require generating a long reforecast climatology to fit a statistical extreme value distribution. This was beyond the scope of the current study. We acknowledge that this approach penalizes model that has a systematic bias but still captures the correct return period, but a systematic bias is also not desirable. We will discuss this issue.
Suppl. Tab 5. Is the loss only computed for those p in P that are associated to a GRDC station or also on all other points (here then on dis24 from GloFAS)?
For GRDC finetuning, P=245954 is the number of input points per sample. The loss is computed only on 3366 points where GRDC observations are available. We do not use the dis24 GloFAS here to compute the loss. We will make this clearer in the revision.
Suppl F2. How can Glofas be dropped here?
We only drop GloFAS from the input and keep the data as a target to train the model. During inference, we still use the last step of GloFAS-Reanalysis and add it to to generate the discharge. This ensures consistency within the table and isolates the impact of input GloFAS-Reanalysis on the model. RiverMamba still works without taking GloFAS as input. This highlights that the model is more than a post-processor of discharge data and can in fact be used as a backbone for the hydrological modeling. However, if we want to drop GloFAS completely, we need to change the objective function, e.g., by predicting the absolute value of river discharge or the change of discharge w.r.t. climatology.
Fig 13 in the Supplementary.
The GRDC observational time series, which are originally left-labeled, were explicitly converted to right-labeled time series to ensure a temporal consistency with the right-labeled predictions from the GloFAS simulations, RiverMamba, LSTM baseline, and Google reforecast dataset. The F1 scores reported in Fig. 13 are based on synchronized detection windows between model predictions and observations. We will highlight the label alignment step in the revision to avoid potential confusion. Regarding the observed differences between our Fig. 13 and Fig. 3 in Nearing et al. (2024), we note that although both figures present F1 scores by lead time and return period, they differ in several aspects like thresholding strategy, data sampling, and station filtering (evaluation domain).
I would like to thank the authors for their answers. All clarification where on point, helpful, and clear. I do have to admit that I feel like my main issue will still only be partially addressed in the revised manuscript --- and I would like to see an explicit commitment from the authors to clean up the exposition. Either way, I think the manuscript will become much better with the changes the authors promised.
I do have some very minor points that came up :
We describe the details of the LSTM baseline in Section G.3 of the suppl...
I am aware of that and did read it before making my statement. The problem was and still is that the description in Section G.3 is not understandable. As an indication for this take it that 3 out of the 4 reviews asked for a clarification in this regard.
The original model from Nearing et al. (2024) is not publicly available.
This is in contrast to the statement that I found in the paper. There it is written that all models are available: „Fully functional trained models can be found at https://doi.org/10.5281/zenodo.10397664 (ref. 45).“ What is going on here?
For a fair comparison of the methods, we also use the same input to the LSTM model and RiverMamba, which is different to Nearing et al. (2024). [...] We do not think that a model should achieve a median . ... [emphasis mine]
I think the input is the main factor why the performances are so different. In your setup, both, RiverMamba and the LSTM, are post processors for Glofas. Hence, the result can be considerably different from models that directly process meteorological inputs. I must have overseen this in the manuscript (and/or it goes back to my comment on Section G.3), but your explanation here was very insightful!
We will revise the formulation since the term "global" might be misleading. As mentioned in the suppl. material on page 16, we split the Earth into smaller domains with 311K nearest points for each. We thus do not connect points from the Amazon to points in Siberia. However, we give the network the possibility to consider a very large spatio-temporal context. For instance, river networks like Amazon can be very large and providing a model that can handle a very large spatio-temporal context is thus very intuitive.
Please make this part of the main paper and reformulate. The actual method that you are using is very different (and actually still very nice!) from the impression that readers get (or at least that I got) when reading the current manuscript.
Please kindly note that ECMWF states: ...
I am not sure why I should note this. Isn’t the statement exactly what I mentioned? Did I miss something?
I would like to thank the authors for their answers. All clarification where on point, helpful, and clear. I do have to admit that I feel like my main issue will still only be partially addressed in the revised manuscript --- and I would like to see an explicit commitment from the authors to clean up the exposition. Either way, I think the work is very good!
Thank you for your review and for your constructive feedback in improving the quality of this work. It is highly appreciated. We will make it clear in the main paper that the LSTM trained in our work is not exactly the same as Nearing et al. (2024).
I am aware of that and did read it before making my statement. The problem was and still is that the description in Section G.3 is not understandable. As an indication for this take it that 3 out of the 4 reviews asked for a clarification in this regard.
Thank you for this important comment and we understand your concern. In the revised manuscript, we will substantially revise Section G.3 with a focus on:
-
Clearly listing the input variables used in our LSTM setup (including GloFAS reanalysis and exclusion of IMERG).
-
Explaining why the pretrained models from Nearing et al. (2024). could not be reused, due to incompatible inputs and the lack of released training code (more details please see our response to the next question).
-
Highlighting that we adopted the same encoder-decoder LSTM design as Nearing et al. (2024) and used the NeuralHydrology repository.
We will also clarify this point in the main paper to avoid any confusion about model comparison.
This is in contrast to the statement that I found in the paper. There it is written that all models are available: "Fully functional trained models can be found at ... (ref. 45)." What is going on here?
The link contains the weights of the trained network for inference, but not the source code for training. Although the saved models can be loaded for inference using the original inputs, it is not possible to retrain or adapt these models to a different input setup, which was required for our experiments. Therefore, the pretrained models could not be used directly. Following Nearing et al. (2024) ("a research version of the machine-learning model used for this study is available as part of the open-source NeuralHydrology repository on GitHub"), we re-implemented and re-trained the LSTM model following the general design in Nearing et al. (2024) and using the NeuralHydrology repository. We will revise the corresponding section in the manuscript to clarify this point.
For reproducibility, we will provide the trained model and the source code for training and inference. This will make it easy for researchers to compare to, or reuse and adapt RiverMamba for their needs.
I think the input is the main factor why the performances are so different. In your setup, both, RiverMamba and the LSTM, are post processors for Glofas. Hence, the result can be considerably different from models that directly process meteorological inputs. I must have overseen this in the manuscript (and/or it goes back to my comment on Section G.3), but your explanation here was very insightful!
We appreciate that this point has been resolved. We will make this more clear in the manuscript. Thanks.
Please make this part of the main paper and reformulate. The actual method that you are using is very different (and actually still very nice!) from the impression that readers get (or at least that I got) when reading the current manuscript.
Thank you for emphasizing on this point. We also think this part is important for the reader and we will make it clearer in the revision of the main paper.
I am not sure why I should note this. Isn’t the statement exactly what I mentioned? Did I miss something?
This is in agreement of what you mentioned. We just wanted to make it clear that ERA5-Land has a lag of 5 days instead of 30 days. In the revision, we will mention that ERA5-Land should be replaced by its near real-time version, namely ERA5-Land-T version or ERA5T, in an operational setting.
In this paper, the authors propose RiverMamba, a novel deep learning model that leverages Mamba blocks—a type of bidirectional state space model—to efficiently model hydrological dynamics at a global scale. RiverMamba integrates diverse data sources, including reanalysis datasets (ERA5-Land, GloFAS), meteorological forecasts (ECMWF HRES), static river attributes, and sparse in-situ gauge observations. The model produces global river discharge forecasts at a fine 0.05° resolution, with lead times of up to 7 days, and is capable of predicting extreme flood events across a wide range of return periods.
优缺点分析
The paper addresses a critical gap in global-scale, high-resolution river discharge and flood forecasting—an area where most existing deep learning models fall short due to their localized focus and limited ability to capture spatial dependencies across river networks. One of the key innovations of this work is its multi-source data fusion architecture combined with an efficient spatio-temporal modeling strategy. RiverMamba stands out as one of the first deep learning frameworks capable of generating high-resolution (0.05°) global river discharge maps. It moves beyond localized predictions by explicitly modeling spatial connectivity across river systems. The use of space-filling curves (such as Gilbert and Sweep) to serialize river points into 1D sequences is particularly creative, enabling scalable sequence modeling using Mamba blocks. Additionally, the model introduces Location-Aware Adaptive Normalization (LOAN) to condition dynamic inputs on static river attributes, further enhancing its ability to generalize across diverse hydrological settings. For operational forecasting, RiverMamba integrates ECMWF HRES meteorological forecasts and accounts for their uncertainties through robust spatio-temporal modeling. Experimental results demonstrate that it outperforms both physics-based models (like GloFAS) and AI baselines (like LSTM), especially in predicting extreme flood events across a wide range of return periods—from 1.5 to 500 years.
问题
The written quality of the paper would benefit from further refinement.
-
When evaluation metrics such as R², KGE, and F1-score are first introduced in the main text, it would be helpful to briefly define them or, at the very least, direct readers to where the definitions can be found (e.g., in the appendix).
-
The presentation of results in Table 1 is confusing and potentially misleading. The reported values for R², KGE, and F1-score all exceed 1, which is unconventional and suggests a transformation has been applied. Upon checking the appendix, it appears that KGE has been multiplied by 100, and it is reasonable to assume the same was done for R² and F1-score—though this is not clearly stated. Rather than burying this detail in the appendix, I strongly recommend that the authors explicitly mention this scaling in the main text, ideally in the caption or title of Table 1. This kind of transformation is non-standard, and without proper explanation, it risks confusing readers and misrepresenting the model’s performance.
局限性
yes
最终评判理由
Thank you to the authors for their time and detailed responses during the rebuttal. I’ve reviewed all comments and appreciate the effort made to address the concerns. I would like to maintain my original score.
格式问题
No
Thank you for your time and the recognition of the novelty of this work. We are glad that you found our work creative and addresses a critical gap for future research.
When evaluation metrics such as R², KGE, and F1-score are first introduced in the main text, it would be helpful to briefly define them or, at the very least, direct readers to where the definitions can be found (e.g., in the appendix).
Thank you for the suggestion. We described the metrics in the Supplementary due to the limited space for the main text. We will follow your suggestion and direct the readers to where the definitions can be found.
The presentation of results in Table 1 is confusing and potentially misleading. The reported values for R², KGE, and F1-score all exceed 1, which is unconventional and suggests a transformation has been applied. Upon checking the appendix, it appears that KGE has been multiplied by 100, and it is reasonable to assume the same was done for R² and F1-score—though this is not clearly stated. Rather than burying this detail in the appendix, I strongly recommend that the authors explicitly mention this scaling in the main text, ideally in the caption or title of Table 1. This kind of transformation is non-standard, and without proper explanation, it risks confusing readers and misrepresenting the model’s performance.
Thank you for this comment. We agree with you about the metrics definition. Following your comment, we will remove the transformation and report the metrics in their conventional values, to be consistent with the values reported in the figures. Note that we still use the transformation in the rebuttal for consistency with the paper.
Dear Reviewer yDqB,
Thank you again for your time and reviewing. We hope that the response has resolved your concerns. Please let us know if there are still any open issues.
The paper studies the river discharge and flood forecasting problem and proposes a Mamba-based model to achieve 0.05 high-resolution forecasting. Reviewers all appreciate the novelty from the application side but have concerns about the baselines, uncertainty quantification, and reproducibility. After reading the paper, the initial reviews, and the author-reviewers discussions, the AC agrees that the paper studies an important problem and comes up with a solid method. While I suggest the acceptance of the work, I strongly suggest the authors to opening all the datasets and codes, and provide more baselines from the Spatial-temporal forecasting domain (e.g., refer to PredBench).