6.0

/10

Poster3 位审稿人

最低6最高6标准差0.0

4.3

置信度

正确性2.7

贡献度2.7

表达3.3

ICLR 2025

CarbonSense: A Multimodal Dataset and Baseline for Carbon Flux Modelling

Matthew Fortier,Mats Leon Richter,Oliver Sonnentag,Christopher Pal

OpenReview PDF

提交: 2024-09-26更新: 2025-03-16

TL;DR

We provide a dataset and benchmark for data-driven carbon flux modelling, a vital component of monitoring ecosystem health.

摘要

Terrestrial carbon fluxes provide vital information about our biosphere's health and its capacity to absorb anthropogenic CO$_2$ emissions. The importance of predicting carbon fluxes has led to the emerging field of data-driven carbon flux modelling (DDCFM), which uses statistical techniques to predict carbon fluxes from biophysical data. However, the field lacks a standardized dataset to promote comparisons between models. To address this gap, we present CarbonSense, the first machine learning-ready dataset for DDCFM. CarbonSense integrates measured carbon fluxes, meteorological predictors, and satellite imagery from 385 locations across the globe, offering comprehensive coverage and facilitating robust model training. Additionally, we provide a baseline model using a current state-of-the-art DDCFM approach and a novel transformer based model. Our experiments illustrate the potential gains that multimodal deep learning techniques can bring to this domain. By providing these resources, we aim to lower the barrier to entry for other deep learning researchers to develop new models and drive new advances in carbon flux modelling.

关键词

datasetmachine learningdeep learningmultimodal modelsclimate chagecarbon dynamicsbenchmark

评审与讨论

审稿意见

评分: 6置信度: 52024-11-01

This paper presents a new benchmark dataset, Carbon Sense, for predicting carbon flux from geospatial and meteorological data. Two models are evaluated on the new benchmark: 1) an XGBoost model which emulates the SOTA method from prior work, and 2) the EcoPerceiver model proposed in this paper. Domain-relevant evaluation metrics of RMSE and NSE (Nash-Sutcliffe Modelling Efficiency) are reported for each model and broken down by ecosystem type, with statistical tests of significance of the differences in performance. EcoPerceiver significantly outperforms the XGBoost model, indicating there is room for improvement on the benchmark relative to the previous SOTA with more sophisticated/tailored ML approaches.

优点

The benchmark has multimodal inputs, which helps fill a notable gap in multimodal benchmarks for remote sensing/geospatial ML.
The paper presents useful and approachable background about the carbon flux modeling problem.
The benchmark code is designed to allow flexibility in reproducing and extending or modifying the dataset based on user needs.
The benchmark has a permissive CC-BY license.
The EcoPerceiver method is well motivated based on the domain-specific carbon flux modeling problem.
Domain relevant metrics are used for evaluation.

缺点

The train/test splits are divided by station location, which avoids spatial autocorrelation issues. It seems there could be significant temporal autocorrelation within each split since there are many measurements from the same location. Is temporal autocorrelation a concern?
The experiments only compared two models, XGBoost and EcoPerceiver. It would be useful to see additional models benchmarked (especially deep learning models) to get a sense of the variation in performance existing solutions on this benchmark.
What does the purple vector in Figure 4 represent?

问题

The train/test splits are divided by station location, which avoids spatial autocorrelation issues. It seems there could be significant temporal autocorrelation within each split since there are many measurements from the same location. Is temporal autocorrelation a concern?
What does the purple vector in Figure 4 represent?
How do other deep learning models (existing models besides the new EcoPerceiver) perform on CarbonSense?

评论- Response to Reviewer 8jPr

2024-11-23

We would like to thank the reviewer for their detailed and insightful comments. We've addressed them pointwise below.

1. Temporal Autocorrelation:

The reviewer raises a very good point regarding temporal autocorrelation. If we understand correctly, the reviewer is referring to how different towers in the training set experience similar climatic conditions over time. If so, this is a valid observation; the model would have difficulty generalizing to a timeframe say 20 years in the future where the climate is quite different.

However, DDCFM is typically used to upscale carbon fluxes within the temporal period of the training data, as opposed to projecting future fluxes (which is more in the realm of process-based modelling). Very good discussion point though, and we may add to the limitations section that the model should not be used to predict fluxes using inputs taken long after the training data ends.

2. Comparison to Additional Models:

We acknowledge the reviewer’s suggestion to include additional models for comparison. Several reviewers have made similar requests, and we address this by running experiments with additional models. For deep learning models, we already provide a comparison to a vanilla transformer model in our ablation studies in Appendix B.6.

For tabular models, we have taken this week to run additional experiments using a random forest as well as a simple linear regression model to give a better idea of the comparative gains in performance. The tables for NSE and RMSE can be found below, and we will include them in the final version of the manuscript.

3. Interpretation of the Purple Vectors in Figure 4:

We apologize for the confusion regarding the purple vectors in Figure 4. These vectors represent the latent arrays of the EcoPerceiver model. A more detailed view of these latent arrays can be seen in Figure 6, but we agree with the reviewer that the purple vectors should be labeled for clarity in Figure 4. This will be corrected in the final version of the manuscript.

NSE Table

IGBP	Linear Model	Random Forest	XGBoost	Transformer	EcoPerceiver
CRO	0.6315	0.7292	0.8066	0.8126	0.8482
CSH	0.5072	0.7107	0.7510	0.7381	0.7670
CVM	0.4282	0.5179	0.5277	0.4809	0.5763
DBF	0.5333	0.6875	0.7250	0.7318	0.7547
DNF	0.2178	0.2975	0.2803	0.2745	0.4336
EBF	0.6381	0.7938	0.7966	0.7464	0.8220
ENF	0.5934	0.7375	0.7765	0.7154	0.7694
GRA	0.6264	0.7258	0.7461	0.6803	0.7967
MF	0.6043	0.7250	0.7559	0.7316	0.7717
OSH	0.0585	0.4113	0.5451	0.5050	0.6060
SAV	0.1632	0.4174	0.5802	0.5288	0.7368
SNO	-0.6223	-0.0130	-0.0370	-0.1229	0.2898
WAT	-32.6151	-27.8940	-11.0524	-9.6845	-14.4010
WET	0.0976	0.2508	0.4530	0.4138	0.4137
WSA	0.4946	0.5575	0.6132	0.5560	0.6267

RMSE Table

IGBP	Linear Model	Random Forest	XGBoost	Transformer	EcoPerceiver
CRO	4.4698	3.8319	3.2381	3.1873	2.8677
CSH	2.1417	1.6411	1.5224	1.5613	1.4709
CVM	6.0688	5.5726	5.5157	5.7824	5.2236
DBF	5.3360	4.3661	4.0959	4.0451	3.8678
DNF	4.2721	4.0485	4.0974	4.1143	3.6322
EBF	6.1426	4.6365	4.6050	5.1420	4.3070
ENF	3.7959	3.0497	2.8141	3.1755	2.8579
GRA	3.9406	3.3759	3.2487	3.6451	2.9059
MF	4.9190	4.1002	3.8633	4.0511	3.7361
OSH	2.7043	2.1384	1.8796	1.9609	1.7475
SAV	2.3315	1.9455	1.6514	1.7497	1.3070
SNO	1.7876	1.4126	1.4291	1.4873	1.1816
WAT	5.3247	4.9366	3.1838	3.0019	3.5802
WET	2.8352	2.5834	2.2073	2.2851	2.2830
WSA	2.8752	2.6903	2.5153	2.6952	2.4706

2024-11-27

I appreciate the addition of more model comparisons, but these are still quite simple models - my question was about other deep learning models. I'm assuming that the expectation is that future methods developed to improve on this benchmark will be deep learning variants like EcoPerceiver and the vanilla Transformer. What other deep learning models would be sensible to compare here, e.g. an MLP or CNN?

评论- Response to Reviewer 8jPr

2024-12-03

We completely agree that testing more deep learning models on this problem would be really beneficial for the community, this is in fact exactly why we have formulated our work as consisting of two key contributions, (1) creating a benchmark, and (2) creating an initial deep learning based model which outperforms both the current state of the art techniques, and a simpler deep learning baseline (a vanilla transformer).

The multimodality and sparsity of the data make it challenging to model quickly and easily with an off-the-shelf deep learning solution such as a standard CNN, so innovative new architectures are required and we truly hope that the existence of this benchmark will encourage many more deep learning researchers to develop solutions to this important problem.

Our motivation with this work was to spark more interest in this problem domain by making it easier for the deep learning community to use this kind of data and by showing that a deep learning approach with non-trivial technical innovation is indeed capable of pushing the state of the art in modelling this problem. If you agree that it would be good to have more people in the ICLR community attacking this problem with more models using our benchmark, we hope that you will consider increasing your score so that this work is more likely to be accepted and gain more visibility by being presented at the conference.

审稿意见

评分: 6置信度: 32024-11-03

Authors present CarbonSense as a machine learding ready multimodal dataset for carbon flux modeling. The dataset includes meteorological variables, MODIS 81 pixels satellite observations, and eddy covariance of carbon fluxes from hundreds of locations globally, Authors, in addition to the proposed dataset, presented a baseline machine learning model to predict carbon fluxes. The model shows the power of the multimodality of the dataset to improve canbon flux modeling performance.

优点

Paper is well written and cohesive. It is easy to follow and understand even for non-experts in the field
Adds a clear contribution to existing dataset in terms of scale and modalities added. This will definitely help progress in the field.
Dataset will be open for anyone to use. This is important for it to make any impact.

缺点

The satellite imagery added is very low resolution wich limits it's potential usefulness.
The dataset does not include many observations outside developed countries. Nothing the authors can do sinse they leverage existing EC stations available.

问题

Consider adding satellite date from Sentinel 2 and 2 or Landsat which are higher resilution for days available. These datasets are open for anyone to use.
Is it possible to keep expanding the dataset in a systematic way by ingesting EC and MODIS data periodically?

评论- Response to Reviewer ZVzD

2024-11-23

We thank the reviewer for their feedback and suggestions. We have addressed the concerns pointwise below.

1. Satellite Imagery Resolution:

We acknowledge the reviewer’s concern about the low spatial resolution of MODIS imagery. This is the tradeoff we made in order to have more frequent images; since the model ingests 36 hours of data in a forward pass, it generally receives 1-2 MODIS images per pass, allowing the model to track changes more consistently.

Sentinel 2 has a temporal resolution of ~5 days, and Landsat ~16 days, so we excluded them for now as most samples would not contain this imagery, and it would severely increase the dataset footprint. This also does not account for cloud cover which further restricts usable imagery.

However, as part of a follow-up study, we are planning to expand the dataset by adding Phenocam data which is collected every 30 minutes at ~100 of the sites included in CarbonSense. This is a large undertaking, so we anticipate releasing this feature in a future version (V2) of CarbonSense.

2. Periodic Expansion of the Dataset:

We also appreciate the suggestion to periodically expand the dataset by ingesting new EC and MODIS data. This is an idea we are considering, as it could turn CarbonSense into a continually improving resource. However, implementing this would also require buy-in from the underlying networks (Ameriflux / FLUXNET / ICOS), and a large amount of funding to ensure the continued support of necessary infrastructure. We have recently been approved for grants to extend this project, so continual integration of new data is being considered as a possible direction.

评论- Response to authors

2024-11-26

Thank you for your response! PhenoCam would be great add and I understand periodic expansion should be something for the future.

评论- Response to Reviewer ZVzD

2024-12-03

Our goal is ultimately to provide as much data as we can so that future researchers will be able to pick subsets of data appropriate to their application / model architecture. Given this, we were wondering if you might consider revising your score to reflect the positive aspects of our contributions; we feel that getting CarbonSense accepted will go a long way toward exposing the deep learning research community to more climate-focused domain applications.

审稿意见

评分: 6置信度: 52024-11-04

This paper proposes a standardized dataset called CarbonSense, which is a dataset compiled from various sources. The compilation steps include fusing the data together on the same hourly scale, extracting relevant features, and doing min max normalization. Along with the dataset, the paper also proposes a data driven model based on Perceiver called EcoPerceiver. This proposed model uses transformer architecture to cross attend all input features. The paper also implemented a SOTA approach baseline and compared its performance on the proposed dataset with EcoPerceiver.

优点

CarbonSense integrates diverse data modalities—measured carbon fluxes, meteorological predictors, and satellite imagery—across a wide array of ecosystems. Researchers can use this dataset as a standardized benchmark.

缺点

1.The dataset was compiled from multiple sources with various modalities, which may introduce inconsistency or OOD samples when doing model training. Careful data analysis can be helpful 2. The experiment shows the proposed EcoPerceiver outperformed the current SOTA approach for most IGBP types especially WET, WAT, and ENF. However, the paper did not include an ablation study to showcase why the proposed model achieved this performance. 3.There is only one baseline compared and there is no model with single modalities.

问题

How does the Perceiver model compare to other advanced machine learning models in terms of performance? In addition, which component of EcoPerceiver affects performance the most? I think adding more machine learning model baselines and conducting ablation study would help determine the model's relative strengths and weaknesses.

评论- Response to Reviewer yjf4

2024-11-23

We would like to thank the reviewer for their valuable feedback. Below, we provide clarifications and additional information to address the points raised.

1. Ablation Study:

We would like to clarify that we do provide extensive ablation studies in Appendix B.6 of the manuscript, where we compare model performance with various architectural components removed, and experiment with different context window lengths.

2. Comparison to Additional Baselines:

We agree with the reviewer that comparing EcoPerceiver to additional baseline models could further strengthen the study. In the ablation section, we compare EcoPerceiver to a vanilla transformer model, as we believe this comparison is the most relevant given the data sparsity and multimodality.

NSE

IGBP	Linear Model	Random Forest	XGBoost	Transformer	EcoPerceiver
CRO	0.6315	0.7292	0.8066	0.8126	0.8482
CSH	0.5072	0.7107	0.7510	0.7381	0.7670
CVM	0.4282	0.5179	0.5277	0.4809	0.5763
DBF	0.5333	0.6875	0.7250	0.7318	0.7547
DNF	0.2178	0.2975	0.2803	0.2745	0.4336
EBF	0.6381	0.7938	0.7966	0.7464	0.8220
ENF	0.5934	0.7375	0.7765	0.7154	0.7694
GRA	0.6264	0.7258	0.7461	0.6803	0.7967
MF	0.6043	0.7250	0.7559	0.7316	0.7717
OSH	0.0585	0.4113	0.5451	0.5050	0.6060
SAV	0.1632	0.4174	0.5802	0.5288	0.7368
SNO	-0.6223	-0.0130	-0.0370	-0.1229	0.2898
WAT	-32.6151	-27.8940	-11.0524	-9.6845	-14.4010
WET	0.0976	0.2508	0.4530	0.4138	0.4137
WSA	0.4946	0.5575	0.6132	0.5560	0.6267

RMSE

IGBP	Linear Model	Random Forest	XGBoost	Transformer	EcoPerceiver
CRO	4.4698	3.8319	3.2381	3.1873	2.8677
CSH	2.1417	1.6411	1.5224	1.5613	1.4709
CVM	6.0688	5.5726	5.5157	5.7824	5.2236
DBF	5.3360	4.3661	4.0959	4.0451	3.8678
DNF	4.2721	4.0485	4.0974	4.1143	3.6322
EBF	6.1426	4.6365	4.6050	5.1420	4.3070
ENF	3.7959	3.0497	2.8141	3.1755	2.8579
GRA	3.9406	3.3759	3.2487	3.6451	2.9059
MF	4.9190	4.1002	3.8633	4.0511	3.7361
OSH	2.7043	2.1384	1.8796	1.9609	1.7475
SAV	2.3315	1.9455	1.6514	1.7497	1.3070
SNO	1.7876	1.4126	1.4291	1.4873	1.1816
WAT	5.3247	4.9366	3.1838	3.0019	3.5802
WET	2.8352	2.5834	2.2073	2.2851	2.2830
WSA	2.8752	2.6903	2.5153	2.6952	2.4706

评论- Response to Reviewer yjf4

2024-12-03

As we approach the end of the review period, we hope the additional information and experiments we've provided have addressed your concerns. If so, we kindly ask that you consider adjusting your score to reflect these updates. We hope that having this work accepted will go a long way toward sparking interest in this problem domain for the deep learning community, leading to better models and a better understanding of our biosphere.

AC 元评审

2024-12-19

The paper presents an ML ready dataset for carbon flux modeling that includes multiple modalities (measure fluxes, satellite images, other predictors) making it a nice comprehensive dataset for use by the community. Further, they present a novel transformer model (based on the Perceiver architecture) to act as a baseline for this dataset and also show ablations regarding this model as well as comparisons to simple ML methods (linear regression, XGBoost, etc).

The main strength is the comprehensive dataset and all reviewers are in agreement regarding its utility for the community. The perceiver architecture is also well-motivated as a strong baseline to handle multiple modalities and missing data. The main weakness is limited deep learning models used for evaluation of the dataset and a related limited analysis on the data itself (due to multiple modalities, missing data, data imbalance). While the second weakness is acknowledged partly by the authors in their limitation, an in-depth analysis on the effects of these could be useful - example, different DL models (conv-based, MLP-based, etc) may be more susceptible to this heterogeneity or be better choices and having them as part of the baselines may prove beneficial as well as an analysis as to why they are better/worse.

审稿人讨论附加意见

The reviewers mainly raised the limited evaluation of the models. All reviewers are agreed on the value of the dataset. Reviewers raised two main concerns: limited baselines/evaluations and limited ablations. For the limited evaluation, the authors added linear regression, RF, XGBoost and vanilla transformer. The first three are very simplistic models. More deep learning models as baselines would be useful, and relating the choice of these baselines to the heterogeneity of the dataset - multiple modalities, missing data, imbalanced data - would make the contributions stronger. For ablations, the authors have the ablations on components of the perceiver in the appendix.

ZVzD also raised the concern that the satellite images are of low resolution that could hinder conclusions but the authors's response on trade-off between more temporal snapshots and higher resolution to keep the dataset size manageable seems fair.

最终决定Accept (Poster)

2025-01-22

Accept (Poster)