5.5

/10

Poster4 位审稿人

最低4最高6标准差0.9

3.5

置信度

正确性2.5

贡献度2.5

表达2.0

NeurIPS 2024

Identifying Spatio-Temporal Drivers of Extreme Events

Mohamad Hakam Shams Eddin,Juergen Gall

OpenReview PDF

提交: 2024-05-14更新: 2024-11-06

TL;DR

We present a deep learning model, designed to leverage climate data to identify the drivers of extreme events impacts.

摘要

关键词

anomaly detectionweakly supervised learningEarth scienceclimate scienceremote sensingdeep learning

评审与讨论

审稿意见

评分: 6置信度: 42024-07-06

In this paper, the author investigates a novel, significant, and practical problem: how to efficiently identify extreme anomaly events from climate data. To address the temporal delays between anomalies and extremes and the spatially uneven response of anomaly events, the author first innovatively constructs three comprehensive datasets, including synthetic and real-world datasets. Next, the author proposes an end-to-end spatio-temporal anomaly detection network model, with key concepts including independent spatio-temporal encoders and a compact quantization layer adapted for anomaly detection. Finally, the author conducts detailed experiments on extreme anomaly event detection tasks using two types of datasets to demonstrate the effectiveness of the proposed datasets and methods.

优点

1. This paper addresses an important but overlooked issue in practical scenarios: identifying extreme anomaly events from climate data. The motivation for this research is intuitive and substantial, as the task of recognizing extreme anomaly events is crucial for understanding climate patterns and can be applied in significant areas such as agricultural production and social activities.

1. Unlike traditional anomaly detection, anomaly events typically exhibit temporal delay characteristics and spatially uneven responses. Therefore, the author first constructs an anomaly event detection dataset approximating the historical Earth system based on reanalysis and remote sensing datasets. However, there are reliability issues with extreme drought events obtained from remote sensing data as ground truth. Consequently, the author further develops multiple synthetic datasets to facilitate more convenient and reasonable experimental analysis and guidance. I greatly admire and appreciate the immense effort behind this work.

1. Furthermore, the author presents an end-to-end neural network solution. Despite the limited technical innovations, the performance achieved is still impressive. In particular, the thorough experimental comparisons and analyses are commendable.

缺点

1. This paper involves extensive preprocessing and adaptation of datasets. Although the author has not released the related code and data details, I believe a thorough introduction to the datasets or code is necessary for future use, which is crucial for subsequent work. This raises another issue: I find the contributions from the problem and dataset introduction to be overwhelming compared to those from the method itself. Therefore, it seems more suitable for the datasets and benchmarks track rather than the research track.

1. This paper appears to remain limited to regional-level anomaly event detection. However, we know that the entire Earth system is dynamically interconnected and evolving; changes in one corner of the Earth can significantly impact distant regions (e.g., the butterfly effect). To model the spatiotemporal dynamics at the global level, the current backbone network seems inadequate for such scalability.

1. Although the paper is structurally clear, there is still much room for improvement in writing and presentation. Particularly for interdisciplinary papers, the author should not assume that readers have backgrounds in multiple disciplines. Necessary background knowledge and related work should be supplemented as thoroughly as possible.

1. I have also listed several confusions and suggestions in the paper. If the author can address my concerns, I would be willing to support its publication here and recommend an increase in its score, even though it is more suitable for the DB track.

问题

1. The second contribution point in Section 1 should be swapped with the first to reflect the paper's primary contribution more logically. The main contribution appears to be the introduction of a novel anomaly detection problem in Earth sciences, supported by detailed benchmark experiments, while the methodological innovation is limited and should be de-emphasized.

1. The anomaly detection part in Section 2 is overly redundant. It should focus on the works most relevant to the new anomaly detection approach proposed in this paper, emphasizing the connections and main differences with existing work, as well as the motivations.

1. In line 113, the phrase "but in spatio-temporal configurations of variables that potentially cause an extreme event with some time delay and at a potentially different location" lacks clarity about how these variables differ from standard anomaly variables. I suggest the author provide a simple illustrative example figure to help readers understand.

1. Considering the large parameters of the video Swin-Transformer, is it suitable as a backbone network for global-level spatiotemporal anomaly detection? There are many lightweight alternative spatiotemporal prediction backbone networks. The author could consider adding experiments with different backbone networks. Although this is not mandatory, it would be commendable.

1. Regarding the dataset, what is the spatial resolution? Specifically, what is the actual area of one pixel? How do different resolutions impact the final detection results? Are there any experiments addressing this?

1. In Table 2(a), the comparison between the first and second rows shows that while the $\mathcal{L}_{(extreme)}$ loss term improves extreme detection, it harms anomaly prediction. Ideally, it should benefit both. Why is this the case? Can you provide an explanation? Similarly, adding the $\mathcal{L}(anomal)$ loss term (comparing the third and first rows) shows it helps both types of detection, but the comparison between the fourth and second rows indicates it harms extreme detection, which seems contradictory. I hope the author can provide reasonable supplementary experiments and explanations.

1. The text in the figures is too small and not reader-friendly. It should be just one size smaller than the main text for better readability.

局限性

I have already listed them in the question section.

2024-08-07

Necessary background knowledge and related work should be supplemented as thoroughly as possible.

Thank you for mentioning this issue. In the revised version, we will extend the introduction and related works sections to provide as much as possible background knowledge given the the page limit.

The anomaly detection part in Section 2 is redundant

In the paragraph "Anomaly detection algorithms" in Section 2, we wanted to give a broader overview why current methods for anomaly detection cannot be applied to the problem that is addressed in this work. We will revise this part in Section 2 and only focus on the most relevant anomaly detection approaches and emphasize the differences to our approach. In fact, none of these approaches is similar to our proposed approach.

Font size in figures

We will increase the font size in the figures.

2024-08-10

Thank you for the amazing work and detailed feedback in such a short time! Most of my confusion has been cleared up. I only have two questions left:

It seems like the paper only mentions inference time. What about the training time and GPU resources used?
Is it possible to provide a brief overview of the dataset and anonymized code repository for review now?

2024-08-11

Although the author did not reply to me, I still give them a point for their previous efforts.

作者回复

2024-08-07

We appreciate and thank the reviewer for recognizing and appreciating the efforts behind this work. Please see our responses to the questions below.

Code and datasets

Please note that we will release the code and datasets along with the documentations upon publication. We describe the datasets and benchmarks in the Appendix in Sections A, H, and I (pages 18-21, and 32-35). Please let us know if more information is required. We are happy to clarify any open questions and provide more details.

The contribution of the benchmark in comparison to the method

We appreciate that the addressed problem and proposed benchmarks are considered as an important contribution. Since the proposed problem has not been addressed before, we also present a novel approach that is not a simple adaptation of existing works to the problem. As you have acknowledged, the proposed approach outperforms adaptation of one-class unsupervised [25; 107; 37], reconstruction-based [44; 49], and multiple instance learning [92; 93; 94] approaches to the problem. We also included additional comparisons in the global response. We thus argue that both the benchmarks and the proposed method are a valuable and important contributions. However, we will change the order of the contributions in Section 1 as suggested.

The backbone and the global level analysis

We already compare 6 different backbones which differ in the number of parameters in the Appendix (Table 11, page 29). We agree that using Video Swin Transformer as backbone for the global level would be very expensive. However, the backbone can be replaced by more efficient backbones. We conducted an additional experiment where we replaced the attention block with a linear selective state space model (Gu et al. "Mamba: Linear-time sequence modeling with selective state spaces", 2023). The results are reported below for F1 scores on the validation set for anomalies/extremes detection:

Backbone $f_{\theta}$	Hidden dimension ( $K$ )	Params.	F1-score
3D CNN	8	63k	57.15 / 91.21
Video Swin Transformer	8	19k	81.22 / 91.16
Mamba	8	15k	82.15 / 90.18
3D CNN	16	250k	70.93 / 93.75
Video Swin Transformer	16	62k	82.78 / 92.45
Mamba	16	56k	83.29 / 92.01
3D CNN	32	998k	84.95 / 93.43
Video Swin Transformer	32	230k	84.14 / 93.12
Mamba	32	214k	84.00 / 93.43

Please note that we did not yet fully train the network with MAMBA as backbone, but the results already indicate that using Mamba instead of Swin Transformer achieves similar or better results when less parameters are used. In contrast to Swin Transformer, Mamba can be scaled to the global scale. Please note we already cover large areas at the continental scale (Table 17, page 33) and we think that identifying spatio-temporal relations at the continental scale is at the moment already challenging enough. We also mention this limitation already lines 345-348.

I have also listed several confusions and suggestions in the paper. If the author can address my concerns, I would be willing to support its publication here and recommend an increase in its score, even though it is more suitable for the DB track.

We appreciate your careful reading of our work and your suggestions. Please find below our responses to your questions that have not been already answered.

Line 113 and a simple illustrative example figure

We will revise the sentence. Please see the author rebuttal and the general response above. We have also included a figure in the PDF (Fig. 2).

The spatial resolution

We provide the spatial resolution in Section 4.2, page 5 and in Appendix Section I.I., page 34. Please note that depending on the coordinate system, the area on the Earth surface changes with respect to the location (i.e., it decreases toward the pole). The spatial resolution for ERA5-Land is $0.1^\circ \times 0.1^\circ$ on the regular latitude longitude grid. CERRA Reanalysis has a spatial resolution of $5.5$ km $\times~ 5.5$ km on its Lambert conformal conical grid. While the remote sensing data has a high resolution of $0.05^\circ \times 0.05^\circ$ . We will make it clearer in the revision. We conducted an additional experiment regarding the impact of the spatial resolution:

Dataset	Region	Spatial resolution	F1-score anomalies/extremes
ERA5-Land	Europe	$0.1^\circ$	- / 31.87
ERA5-Land	Europe	$0.2^\circ$	- / 30.09
Synthetic CERRA	-	$5.5$ km	82.78 / 92.45
Synthetic CERRA	-	$11$ km	68.42 / 79.77

This shows that the spatial resolution matters as expected.

Loss functions in Table 2(a)

Without the $\mathcal{L}\_{(anomaly)}$ loss, the detection of anomalies is not reliable since pixels at regions and intervals where no extreme event occurred can be assigned to $z_{q=1}$ (anomaly) as well. The other loss functions actually do not prevent that this is happening.

In case of $\mathcal{L}\_{(extreme)}$ multi-heads, we observe that anomalies are identified in a small subset of variables because the network omits some variables if there is a correlation with other variables. Please see Fig. 4 in the rebuttal PDF. In case of a single head such flips occur less often, but they can occur. If the $\mathcal{L}\_{(anomaly)}$ loss is used, such flips cannot occur and the multi-head improves both F1-scores by a large margin. When comparing row 2 and 4, there is a slight decrease in extreme prediction but a large improvement in anomaly detection. Note that there is always a trade-off between extreme and anomaly detection. The anomalies generate a bottleneck of information. When more information goes through the bottleneck the better the extreme prediction gets. Without any anomaly detection, the extreme prediction is best as shown in Table 12 (page 29), but the increase in F1 score is only moderate. This is also visible in row 2 and 3 in Table 2(b). Cross-attention improves extreme prediction, but it hurts the detection of anomalies since the information is propagated between the variables.

2024-08-12

Thank you for the amazing work and detailed feedback in such a short time! Most of my confusion has been cleared up. I only have two questions left:

Thank you for your response. We are happy that we could answer your questions.

It seems like the paper only mentions inference time. What about the training time and GPU resources used?

The training was done on a cluster with NVIDIA A100 80GB and NVIDIA A40 48GB GPUs (line 927). The training on the real-world data for EUR-11 took about $\sim21$ hours with a Swin model, $K=16$ , and $4$ NVIDIA A $100$ GPUs. In the following, we give a rough estimation for training on the synthetic CERRA for 1 epoch:

Algorithm	time (min)	GPU
SimpleNet	$\sim2$	A100
STEALNet	$\sim1$	A100
UniAD	$\sim11$	4 $\times$ A100
DeepMIL	$\sim13$	A40
ARNet	$\sim13$	A40
RTFM	$\sim20$	A40
Ours	$\sim8$	A40

SimpleNet was trained with a pretrained backbone. The training time includes some postprocessing to compute metrics on the training set. The time might also differ depending on the I/O during training and the number of available workers.

Is it possible to provide a brief overview of the dataset and anonymized code repository for review now?

We have prepared two anonymized repositories: 1) a repository including the framework to generate synthetic data and 2) a main repository which includes the main scripts for training/testing on the real and synthetic data. Please note that the dataset is very large (about 1.9 TB) and the uploaded data include only a subset of the data. We did not have the time yet to provide a detailed documentation, which we will prepare when releasing the data and code. Following the review guidelines, we have sent the link to the AC who can forward you the link.

Although the author did not reply to me, I still give them a point for their previous efforts.

Thank you. Please see our responses above.

2024-08-12

Well done! My final suggestion is that the author could create some subsets of the data to make it easier for future researchers to innovate and follow up on the methods, considering that the current dataset is too large. Anyway, thanks to the author for the clarification, and I hope they address all the reviewers' comments in the final version.

2024-08-12

Thank you for the note. We will make it possible to download a subset of the data.

审稿意见

评分: 6置信度: 42024-07-09

This work aims to identify the atmospheric drivers of extreme droughts. For this, they assume that for every impact of extreme droughts measurable with remote sensing, there is a precursor signal in assimilated land surface and meteorological data. The work proposes to identify these precursor signals with inherently interpretable machine learning: a computer vision model is trained to map input data into a binary latent space. These binary encodings are subsequently used to predict the future occurance of drought impacts. In that way, the binary encodings are assumed to be interpretable as "anomaly in atmospheric data" and "normal atmospheric data". Models trained in this way achieve good prediction skill (F1 score ~0.9) of drought impacts on synthetic data. Also, the binary encodings match anomalies in the synthetic input data reasonably well (F1 score ~0.8). The prediction skill on real world data is rather low: F1 Score ~0.2 for drought impacts.

优点

This work introduces a potentially novel variant of anomaly detection: Detecting only those Anomalies that are predictive for correlated impacts. This variant is relevant for studying the drivers of extreme drought impacts.
The work compares a wide array of baselines and performs many ablation studies.
Synthetic experiments are conducted to study the proposed method before shifting to real-world data
The main text of the paper is reasonably concisely written, with many additional details supplied in the appendix.

缺点

Major points:

Confusing terminology: The authors speak about "anomalies" and "extreme events" without properly defining what is meant with which term. Furthermore I believe the used terminology is non-standard in the field, and propose the authors to instead use:
- Land surface Impacts of extreme events: These are what you call "extreme events", i.e. the VHI below a certain threshold. I would argue what you mean is the impact on state variables representing the land surface state (the ecosystem health) of extreme events. In your case droughts, but this could be any type of extreme events.
- Atmospheric drivers of extreme events: These are what you call "anomalies", but both the VHI below a certain threshold or the surface temperature above a certain level could be considered anomalous. Hence I recommend you rather focus on drivers here, these could be atmospheric or hydrological state variables (e.g. temperature or soil moisture) or land-atmosphere fluxes (e.g. evaporation).
- Luckily, you should be able to resolve this issue through simply rewriting your article.
Scientific validity of experiment design: I find a few choices of the authors a bit odd in the experiment design:
- Albedo "fal" / "al" & Soil temperature "stl1" are state variables of the land surface that should be very related to reflectance. In fact, I believe it is not unlikely that remotely sensed brightness and brightness temperature has been assimilated to obtain these variables. This is not too different from VHI, which is created from similare remote sensing products. Thus I would say it should be if anything an Output of your approach. And even if no satellite products have been used to assimilate these variables, this would not solve the issue, but rather raise another issue: then the variables would entirely depend on prescribed schemes in the land surface model of IFS, which means your whole approach is limited by how well IFS reproduces these variables, which I assume is pretty poorly, so any "anomalies" you detect in these variables could be considered spurious.
- Soil moisture "swvl1" / "vsw" is a state variable of the hydrological cycle, and thus should be highly correlated with VHI. However, as far as I know, its representation in ECMWFs land surface reanalysis is relatively poor (e.g. https://ieeexplore.ieee.org/document/9957057)
- The anomalies in precipitation that would drive drought are no precipitation for many weeks. So if you produce a binary encoding for every single time step, this should not be very predictive. In fact, precipitation is somewhat exponentially distributed: in most regions many days observe 0 precipitation, even under no drought conditions. One way you could potentially circumvent this issue is by using accumulated precipitation over many weeks, e.g. through an exponential moving average. Another way could be to actually implement a simple water balance model. Than again, if you use Soil moisture as inputs, it is essentially coming from such a water balance model...
- In other words: I think scientifically most interesting would be if you could connect anomalies in atmospheric variables like temperature, humidity and precipitation to the anomalies in land surface states (VHI). Then you could find the primary drivers for the impacts on vegetation and their time lag, and their spatio-temporal variability, which could be super interesting to study.
Low predictive skill: The performance on the real-world data is pretty bad if I understand correctly (F1 of 0.2 - 0.3, Table 14). In addition the synthetic experiments revealed there seemingly is typically a lower skill on the latent binary variables compared to the outputs, so this makes we wonder if the predicted anomalies for extreme events mean anything at all?
Many baselines for anomaly detection, but none for interpretable forecasting. You compare with a lot of baselines, which is generally great. But all of these perform some sort of anomaly detection on the inputs, which you then assume to be predictive features for your VHI labels. To me what would be a more interesting baseline is one that directly predicts the VHI label from the inputs, and then uses some post-hoc method to try to explain the predictions (e.g. shap, integrated gradients, ...). Because then you are comparing predictions of the drivers directly, and not just general anomalies.

Minor Points:

This work seems to focus a lot on the spatial aspect of things. However, arguably what matters most for drought at a particular pixel is the water balance at that pixel. And that is primarily driven by precipitation and evapotranspiration at that pixel, with only runoff introducing some sort of spatial component.
VHI < 26 may be the result of not just drought. A heatwave could have a similar effect. Also, VHI is a general vegetation condition index, not just for agricultural areas, but also for natural land cover. You may wish to rephrase your framing of this work studying the "drivers of impacts of agricultural droughts" into the "drivers of impacts of extreme events on vegetation".
An alternative approach could be to not just predict a binary label, but rather the exact value of VHI. This would be similar to vegetation forecasting (e.g. https://www.sciencedirect.com/science/article/abs/pii/S003442572030256X , https://openaccess.thecvf.com/content/CVPR2024/html/Benson_Multi-modal_Learning_for_Geospatial_Vegetation_Forecasting_CVPR_2024_paper.html , https://gmd.copernicus.org/articles/17/2987/2024/ ). Probably it would be relevant to mention this related stream of literature also in the related works section. Also you may want to consider adding a comment on why you directly predict the (VHI < 26) label instead of the raw VHI, and afterwards apply detection.
Quite some typos, e.g. l.22 "very" instead of "vary"

问题

Climate data has extremely high spatio-temporal autocorrelation. How do you ensure your models are not overfitting?

局限性

The authors mention a variety of limitations in section 6. However, I believe as is the work has more fundamental flaws that I mention in the Weaknesses section, which, unless fixed, should definitely be named as limitations.

作者回复

2024-08-07

Thank you for the detailed review and the thoughtful feedback. We are glad that you found our work interesting and important for future research.

Terminology

Thank you for this suggestion. We will follow your suggestion and define the terminology in the introduction section of the paper in the revised version. Please see the discussion in the global response above.

Albedo & Soil temperature

Please note that ERA5-Land does not use data assimilation directly. The evolution and the simulated land fields are controlled by the ERA5 atmospheric forcing. We conducted 4 more experiments on both CERRA and ERA5-Land where we trained models that take only one variable al/fal or stl as input and predict the extreme events directly without the anomaly detection step. In all of these experiments, the F1-score was very low. In the next experiment, we increased the threshold for VHI and trained new models to predict extremes directly. The results for the validation set are shown below:

Dataset	Region	Variable	VHI<26	VHI<40	VHI<50
ERA5-Land	EUR-11	stl1	05.67	31.53	58.36
ERA5-Land	EUR-11	t2m, fal, e, tp, stl1, swvl1	33.80	46.72	68.71

The first potential reason to consider is that some land surface variables might deviate from the reality. Another reason might be that when training only on extremes (VHI < 26), there are not enough samples to learn the relations. Please note that VHI is a combination of both TCI and VCI. Most extremes (VHI < 26) might result from a deficiency in both stl/t2m and vsw. This might also explain why stl and albedo cannot be that informative to predict very extreme events. We will discuss this issue.

Soil moisture in ERA5-Land

It is true that the volumetric soil water variable in ERA5-land has some biases. One solution is to use satellite observations for the top layer. However, our experiments showed that the model relates vsw anomalies with the extremes in VHI and provides reasonable predictions. Although we do not consider this as a major issue, we will mention it in the revised version.

Anomalies in precipitation

Thank you for this notice. At the moment, we treat each input variable in the same way and do not apply any pre-processing that is specific to a single variable. The proposed approach to pre-process precipitation is indeed an interesting direction. We will mention this in the discussion in the revised version.

Connecting anomalies in atmospheric variables to the anomalies in land surface states (VHI)

We agree that this is the mid-term goal, but this is beyond the scope of the paper. The purpose of the paper is to present a novel approach that addresses this very important problem and a benchmark that allows to systematically evaluate approaches for this new task. The impact will be two-fold. First, methods can be further developed that improve the detection of drivers on the synthetic datasets. Second, the anomalies that are detected by the method in atmospheric variables can be further investigated by statistically approaches.

The performance on the real-world data

We respectfully disagree regarding this point. The performance depends on the type and ratio of extremes, spatio-temporal resolution, the quality and consistency between the remote sensing and the reanalysis data. The performance is consistent with other recent works for predicting extremes on real-world data (Nearing et al. "Global prediction of extreme floods in ungauged watersheds", Nature, 2024). Note that the F1 scores substantially increase when the threshold on VHI is increased (see previous answer regarding stl1). We include a figure in the rebuttal PDF (Fig. 1), which shows the predictions. Given the quantitative and qualitative results, we think that the model provides reasonable predictions. Note that it is not required to predict all extremes in order to learn some relations from the predicted events.

Baseline as an interpretable forecasting

Please see the general author rebuttal and response above for the requested experiment.

The spatial aspect

We present a general approach that is not limited to droughts and specific input variables. We used droughts only as a real-world example. For instance, variables over sea regions can impact variables over land regions.

VHI is a general vegetation condition index

Thank you for this suggestion. We mention this issue in lines 987-989. We followed the general association of VCI and VHI with agricultural droughts, see i.e., (Hao et al. "Seasonal drought prediction: Advances, challenges, and future prospects.", Reviews of Geophysics, 2018). We will discuss this in the revision.

Predicting the exact values of VHI

Since we will include the new baselines based on forecasting and integrated gradients (see previous answer), we will briefly discuss the mentioned methods on vegetation forecasting. The mentioned work by Shams Eddin et al. ("Focal-TSMP: deep learning for vegetation health prediction and agricultural drought assessment from a regional climate simulation", GMD 2024) has shown that it is hard to predict VHI directly. Instead, they predict NDVI and BT and then normalize the predicted values to estimate VHI. Another reason is that some extremes cannot be derived from satellite products but are stored as binary or discrete variables in databases. Having such scenarios in mind as well, we decided to present extremes as binary variables. Extending the approach to continuous variables is a potential future direction.

Typos

Thank you for the careful reading. We will fix this typo and check the paper for any other typos.

How do you ensure your models are not overfitting?

We follow the common practice in climate science where we define different time periods for the training/validation/test sets (Table 17 in the Appendix, page 33). As also shown in the Appendix (Table 11, page 29), increasing the model parameters still does not show a sign of overfitting.

2024-08-11

Dear Authors,

thank you for taking the time to address my comments.

The additional results on IG are convincing, the IG models achieve similar performance on "extreme" detection, but are much worse on identifying the drivers (both quantitatively and qualitatively: artifacts in t2m and missing soil moisture influence).

Re: the chosen variables. Thanks for presenting further results. I still believe this work would be much more impressive if it would not use albedo and soil temp as inputs and instead focus on indicators of atmospheric and hydrological conditions. Along this line, SPEI could also be interesting to look at, as it is often used to define drought, but does not always reflect impacts on vegetation.

Re: performance. I read your argument as, other works have similar "poor" performance. While a stronger performance would certainly be more impressive, i'd argue it is not essential for this papers merit (which is the creative methodology). Still, it would be important to elaborate in the paper, that drawing conclusions on drivers from weaker predictive models may render those interpretations invalid.

For now I will raise the score mildly, and will consider raising further at the end of the rebuttal period. Thanks!

2024-08-12

Dear Authors, thank you for taking the time to address my comments.

Thank you for your review and your suggestions in improving the quality of this work. It is highly appreciated.

The additional results on IG are convincing, the IG models achieve similar performance on "extreme" detection, but are much worse on identifying the drivers (both quantitatively and qualitatively: artifacts in t2m and missing soil moisture influence).

Thanks.

Re: the chosen variables. Thanks for presenting further results. I still believe this work would be much more impressive if it would not use albedo and soil temp as inputs and instead focus on indicators of atmospheric and hydrological conditions. Along this line, SPEI could also be interesting to look at, as it is often used to define drought, but does not always reflect impacts on vegetation.

In this work, we have chosen VHI from remote sensing data because it cannot be directly derived from the input reanalysis, which makes the task very challenging. If we remove albedo and soil temperature from the input, the results on the real data would not change much. We will include such an experiment. We agree that it is very interesting to apply the method to other combinations of input variables and other indicators like SPEI, SPI, PDSI, or SMA in the future. We will release and document the code such that it will be simple to select any subset of the input variables and apply it to other indicators if data is available.

Re: performance. I read your argument as, other works have similar "poor" performance. While a stronger performance would certainly be more impressive, i'd argue it is not essential for this papers merit (which is the creative methodology). Still, it would be important to elaborate in the paper, that drawing conclusions on drivers from weaker predictive models may render those interpretations invalid.

Thank you for your suggestion. We will discuss this limitation of weaker predictive models in Section 6.

For now I will raise the score mildly, and will consider raising further at the end of the rebuttal period. Thanks!

Thank you.

2024-08-12

Actually, one more thing. Given you are going to do a major rewrite regarding the terminology of anomalies and extreme events, how are you going to change the paper title to reflect this?

2024-08-12

Currently, we would change the title to: Identifying spatio-temporal drivers for extreme events.

审稿意见

评分: 6置信度: 32024-07-12

The paper proposes a novel approach to identifying spatio-temporal anomalies correlated to extremes such as drought. Using neural network, to predict extreme events by learning spatio-temporal binary masks of anomalies identified in climate data. The network is trained end-to-end to predict both anomalies and extremes from physical input variables, focusing on the spatio-temporal relations between them.

优点

Introduces a new method for identifying spatio-temporal anomalies that are correlated with extreme events.

缺点

The model is dependent on temporal resolution, which might not be well documented in all parts of the world. The method only shows results on droughts. Binary masks tend to oversimplify real-world events. The method seems to bluntly connect anomalies with extremes without specific theoretical reasoning.

问题

Have you tried the method on other extreme events other than droughts? And what if there is discontinuity in terms of temporal data?

局限性

If possible, please add the reasoning behind feature represetations and extremes.

作者回复

2024-08-07

We appreciate the reviewer’s feedback for this work and that the reviewer recognized the novelty of this work. We respond to the questions below.

The method only shows results on droughts. Have you tried the method on other extreme events other than droughts?

We tested the algorithm on 9 types of extreme events; real-world agricultural droughts (Table 14 in Appendix, page 30), synthetic CERRA extreme events (Tables 1 and 3, pages 6 and 20), 5 variations of the synthetic CERRA extreme events (please see Fig. 3 where we changed the coupling between the climate variables and consequently changed the type of the events), synthetic NOAA extreme events (Tables 4 and 6, pages 20 and 27), and synthetic artificial extreme events (Tables 5 and 7, pages 21 and 27). The synthetic datasets show that the approach is not limited to droughts and can be applied to other extremes. The main difficulty is that suitable datasets for extreme events are rare, i.e., the dataset should have a high resolution and a long-term and large-scale coverage. Building a dataset from existing sources is thus a major effort (see Section H and I in the Appendix, pages 32-35). In the future, we aim to test the model on more types of events like flood, but this requires to prepare the data first.

The model is dependent on temporal resolution, which might not be well documented in all parts of the world. What if there is discontinuity in terms of temporal data?

We assume that you are referring to the temporal gaps in the reanalysis and remote sensing data. In fact, this is an issue for real-world data. To tackle this issue for the remote sensing data, a temporal decomposition was conducted to remove some discontinuity and aggregate the data into a weekly product. However, some pixels will still be empty. To tackle this issue, we first check if the pixel was covered by another satellite. If it was not the case, we flag the pixel as invalid and we discard it from the training and evaluation. Regarding the input reanalysis data, we first normalize the data using the pre-computed statistics and then replace the invalid pixels with zero values. We will add more details to the Appendix.

Binary masks tend to oversimplify real-world events.

Please note that we only use the binary masks as flags where anomalies or extremes are detected. Using the binary approach allows to use the approach in the future for other extremes that cannot be derived from satellite products but that are stored in a binary format in databases. We agree that an analysis regarding the physical implication requires the continuous values where binary masks only indicate the existence of anomalies or extremes, see paragraph "Physical consistency" in Section 5.2, page 9.

The method seems to bluntly connect anomalies with extremes without specific theoretical reasoning.

Our aim is to investigate the relations between extreme events and their drivers from a data-driven perspective. The synthetic examples demonstrate that our proposed approach is able to achieve this. In contrast to statistical methods, our method does not require a prior hypothesis about drivers for extremes; instead, it generates hypotheses that can be verified by statistical methods in a second step. We believe that this is an important direction since climate reanalysis provides huge amount of data and it is infeasible to test all combinations. Data-driven approaches are therefore needed to generate potential candidates. Please also see a related discussion in the first paragraph "Anomalies and extreme events detection in climate data" in Section 2.

If possible, please add the reasoning behind feature representations and extremes.

We are not sure if the previous answers already answered the question.

审稿意见

评分: 4置信度: 32024-07-15

This paper proposes an approach to learning the spatio-temporal relationships between events with spatial differences and temporal delays. Specifically, they propose a method that identifies spatial-temporal anomalies in multivariate climate data that are correlated with extremes. The authors conduct experiments on both synthetic data and climate reanalysis data.

优点

The problem of anomaly detection and learning their relations is crucial.
The summary of the relevant literature is relatively complete.
The authors conduct experiments on both synthetic data and real-world data.

缺点

The motivation of model design is not clear. For example, why do you need to detect the anomaly and then classify the extreme events instead of detecting extreme events directly? Such a designed pipeline will lead to more accumulated errors.
The writing can be improved. For example, there are typos, such as 'MIL Is a weakly ...'. And the difference between anomaly and extreme in this paper is not clear.
The title is somehow misleading, actually what the paper does is more about extreme event prediction instead of learning spatial-temporal relations.

问题

Anamly detection is a classification problem with severe class imbalance problems, how did the authors tackle this problem? What is the ratio of extreme events to ordinary events?
What is the difference between anomaly and extreme in this paper? Could you provide some examples to illustrate this?
When I read spatio-temporal relations, I thought this paper would build spatio-temporal graphs to describe the relations. Have you considered using graphs to tackle this problem?

局限性

The authors adequately addressed the limitations.

作者回复

2024-08-07

Thank you for your time and for reviewing our work. We are glad that you found the task and the problem we address in this work important. In the following, we answer your questions.

Difference between anomaly and extreme in this paper is not clear. What is the difference between anomaly and extreme in this paper? Could you provide some examples to illustrate this?

We will define the terms more precisely in the revision. Please see the author rebuttal and the global response above for clarification.

The title is somehow misleading, actually what the paper does is more about extreme event prediction instead of learning spatial-temporal relations. The motivation of model design is not clear. For example, why do you need to detect the anomaly and then classify the extreme events instead of detecting extreme events directly? Such a designed pipeline will lead to more accumulated errors.

There seems to be some misunderstanding about the objective of this work, probably caused by the terms "anomaly" and "extreme". Our work is not about extreme event prediction in the first place but about identifying the anomalous drivers of these extreme events like droughts. Note that droughts can be observed, but it is unclear what are the anomalies in the atmospheric or hydrological state variables that are spatio-temporally connected with a drought. Note that the anomalies in the atmospheric or hydrological state variables can occur earlier than the drought and at a different spatial location.

We aim to identify these anomalies that are spatio-temporally connected with an observed extreme event. Since we only observe droughts and are only interested in atmospheric or hydrological state variables that are spatio-temporally connected with droughts, we design a network (Figure 1) that spatio-temporally identifies anomalies in the atmospheric or hydrological state variables and predicts from these the droughts in an end-to-end fashion. In other words, we force the network to reduce the input variables to spatio-temporal anomalies (quantization) that are sufficient to predict the drought. Due to the quantization, the prediction of a drought is lower compared to predicting droughts without identifying anomalies in the input as reported in lines 314-319 and in Table 12 (Appendix C.5, page 29). The drop of the F1-score on drought detection, however, is relatively small. Nevertheless, we are interested in detecting the driving anomalies that are spatio-temporally connected to a drought and not the drought itself. We will clarify this and revise the title if necessary.

Anomaly detection is a classification problem with severe class imbalance problems, how did the authors tackle this problem?

To solve the class imbalance issue, we utilized a weighted binary cross entropy for $\mathcal{L}_{(extreme)}$ . Please see Appendix C.3 (page 28) and Table 10 (page 29). We will make this clearer in the method section in the revised version.

What is the ratio of extreme events to ordinary events?

The ratios of extremes are reported in Tables 3-5 (second last column, pages 20-21) in the Appendix. For the synthetic CERRA reanalysis (Table 3, page 20), the ratio is $1.16\%$ while the ratio of the correlated anomalies with these extremes is $1.69\%$ and the ratio of the random uncorrelated anomalies with extremes is $1.32\%$ . For convenience, we summarize the ratios below:

Dataset	extremes (%)	correlated anomalies (%)	random anomalies (%)
Synthetic CERRA	1.16	1.69	1.32
Synthetic NOAA	0.79	1.02	1.76
Synthetic artificial	1.24	1.81	2.93

Please note that there is no ground truth for anomalies in the real-world dataset. We only report the ratio of extreme events, which can be detected using remote sensing data:

Dataset	Region	extremes (%)
		Val	Test
CERRA	Europe	4.34	5.32
ERA5-Land	Europe	3.20	2.86
ERA5-Land	Africa	6.41	6.87
ERA5-Land	North America	3.68	6.61
ERA5-Land	South America	5.16	6.53
ERA5-Land	Central Asia	3.60	4.38
ERA5-Land	East Asia	3.16	3.05

When I read spatio-temporal relations, I thought this paper would build spatio-temporal graphs to describe the relations. Have you considered using graphs to tackle this problem?

Note that our real-world data like the CERRA dataset has a spatial resolution of $1069\times1069$ (Table 17 in Appendix, page 33) and we consider 8 time steps. A spatio-temporal graph would consists of $1069\times1069\times8=9,142,088$ nodes, which would be computationally very expensive. Even for visualizing the results shown in Figures 16, 18, 20, 22, 24, 26 and 28 (pages 37-43), a spatio-temporal graph would not be suitable.

The writing can be improved. For example, there are typos, such as 'MIL Is a weakly ...'.

Thank you for the careful reading of our paper. We will fix this typo and check the paper for any other typos.

2024-08-12

Thank you again for your time and reviewing. We hope that the responses have resolved your concerns. Please let us know if there are still any open questions.

作者回复

2024-08-07

We thank all reviewers for their efforts and the valuable comments. We appreciate the positive and encouraging comments by the reviewers that we briefly summarize:

Reviewer fRrQ acknowledges that the proposed task is crucial for climate science and acknowledges the experiments on real and synthetic data.
Reviewer 5qpR acknowledges the novelty of our method for identifying spatio-temporal anomalies that are correlated with extreme events.
Reviewer TRJw acknowledges the novel variant of anomaly detection which is relevant for studying the drivers of extreme events, and appreciates the experimental evaluation, including the wide array of baselines, ablation studies, and the evaluation on both synthetic and real-world data.
Reviewer wHsZ acknowledges that the work addresses a crucial, practical, and overlooked task. wHsZ also appreciates the immense effort behind this work, the thorough experimental analysis, and the impressive performance of the porposed method.

The reviewers fRrQ, TRJw, and wHsZ raised issues regarding the presentation. The main presentation issue are the used terms "anomalies" and "extreme events". We agree that this terms need to rephrased and more clearly defined since an extreme event is an anomaly as well. We believe that this also resulted in a misunderstanding of our task and contribution by reviewer fRrQ. In the following, we give a brief definition of the terms "anomalies" and "extreme events" as they were used in the paper:

Extreme events: Examples of extreme events are extreme droughts, floods, or heatwaves. We represent these events by the impact on state variables. For instance, we use extremely low values in vegetation health index (VHI) as an indicator for extreme droughts. We assume that extreme events are reported or can be derived from state variables, i.e., they are observed.

Anomalies: We consider anomalies in atmospheric/hydrological state variables (e.g. temperature or soil moisture) or land-atmosphere fluxes (e.g. evaporation) that are the drivers of extreme events. In other words, we are looking for anomalies in a) other variables than the variables that is used to define a particular extreme event of interest, b) might occur earlier in time and at a different location than the extreme event and c) that are drivers or directly related to the extreme event. This means that not all anomalies that might occur in the atmospheric/hydrological state variables are related to an extreme event. Figure 3 in the PDF of the rebuttal illustrates this.

We agree that the term "Anomalies" is confusing and we will rephrase it as suggested by Reviewer TRJw, who also points out that this issue can be simply addressed. The other presentation issues are minor and we explain in the comments to the individual reviewer how we will address them.

We hope that this response also resolves the misunderstanding of Reviewer fRrQ, who struggled to understand the difference between "anomaly" and "extreme" and consequently misunderstood our work as an approach for extreme event prediction (and thus the motivation of the model design), although our focus is on identifying the drivers of extreme events.

Reviewer 5qpR has some concern about the dependency of the model on the temporal resolution and the applicability of the model to other types of extremes, which we address in our response to 5qpR.

Reviewer TRJw asks for a comparison to interpretable forecasting approaches using integrated gradients. As suggested, we conducted this additional comparison. To this end, we trained two models that predict extreme events directly from the input variables and then we applied a post-hoc integrated gradients. Both models use the same backbone as our model but without the anomaly detection step. For (Integrated Gradients V2), we added a cross attention. For this experiment, we compute the gradient only with respect to predicted extremes and computed a different threshold for each variable separately. The models achieved F1-scores for detecting extremes of 93.09 for (Integrated Gradients V1) and 93.97 for (Integrated Gradients V2). The F1-score for "anomalies" (we use the term here for consistency with the submission) on the synthetic data are:

Model	Val	Test
Integrated Gradients V1	38.14	33.11
Integrated Gradients V2	35.39	34.87
Ours	82.78	80.44

Qualitative results are provided in Figure 2 in the PDF of the rebuttal. When we add more interactions between the variables (Integrated Gradients V2), the gradients tend to omit some variables (soil moisture). Both models have also difficulties with the synthetic t2m, which includes red noise by design. These results demonstrate that networks that predict the extremes directly from the input variables utilize much more information even it is not correlated with an extreme. It is thus beneficial to introduce a bottleneck into the network that enforces the network to explicitly identify drivers for extremes. We will include these additional baselines.

The other concerns of reviewer TRJw are addressed in our response to TRJw.

Reviewer wHsZ has a concern about the backbone and its applicability to the global scale. We already compare 6 different backbones which differ in the number of parameters in the Appendix (Table 11, page 29) and provide results for additional backbones in the response to wHsZ. Reviewer wHsZ also rates the contribution of the benchmark higher than the proposed method, but we do not think that this is a major issue and we address this point as well as the other raised questions in the response to wHsZ.

The PDF in the rebuttal contains additional figures. Figure 1 shows examples of predicted extreme events on real data and Figure 4 shows examples for the ablation study in Table 2.

We hope that the responses to the reviewers resolve their concerns. We would appreciate if the reviewers reply and give us feedback if the questions have been answered. We are happy to answer further questions.

最终决定Accept (poster)

2024-09-25

This paper proposes a novel approach for detecting spatio-temporal anomalies associated with extreme weather events which gives a solid technical contribution particularly in identifying the drivers of extreme events in climate science. The authors have adequately addressed the reviewers' concerns by clarifying the terminology and conducting additional experiments, which improved the overall quality of this paper.