Alberta Wells Dataset: Pinpointing Oil and Gas Wells from Satellite Imagery
We introduce a large benchmark dataset for pinpointing abandoned oil and gas wells, a significant contributor to climate change and pollution.
摘要
评审与讨论
The paper presents the large-scale benchmark dataset Alberta Wells for pinpointing oil and gas well, comprising over 210,000 wells and including three classes (abandoned, suspended and active), which frames the problem of identification of wells as a challenge for object detection and binary segmentation. To create a well-distributed dataset, the paper introduces a splitting algorithm based on clustering, which makes the dataset maintain an equal distribution of well and non-well images and be more diverse for evaluation and testing. In experiment, the paper selected the well-known baseline models for binary segmentation and object detection to evaluate their performance on Alberta Wells Dataset, which build up a new benchmark and validate the value of NIR imagery and multiple well types.
给作者的问题
- I wonder if it is possible to address the influence false negatives in the dataset due to missing well locations in the Alberta Energy Regulator's data.
- In Table 7, why does U-Net get a remarkable drop in Precision from 0.998 to 0.913 trained on all three types in the dataset?
论据与证据
The authors present a comprehensive procedure of data collection, dataset splitting and label creation. Besides the detailed description of the experiments and quantitative results for both segmentation and detection tasks make the claims evidential.
方法与评估标准
The authors select a series of well-known baseline models for both segmentation and detection tasks, and they provide a thorough evaluation using standard metrics such as IoU, mAP, precision, recall, and F1-score. The dataset splitting algorithm, which is explained in detail in the form of pseudo-code, ensures a diverse and representative distribution of wells across different geographical region. The value provided by the inclusion of NIR imagery and two new types of wells is proved through comparative experiment.
理论论述
The paper does not make too many theoretical claims but rather focuses on the practical application of deep learning models to a real-world problem and a new dataset. The theoretical foundation of the models used are well-established in the literature, and the paper does not attempt to extend these theories. Instead, it focuses on the evaluation of these models on a novel dataset.
实验设计与分析
This paper tests a series of well-known models for semantic segmentation and target detection on Alberta Wells Dataset and analyzes the performance differences in terms of their framework features to set a benchmark. The evaluation of models is comprehensive, with respect to IoU, Precision, Recall, and F1-Score. The authors also consider the value provided by inclusion of NIR imagery and all three types of well. However, the paper doesn’t provide a more detailed discussion of the potential impact of label noise on the results.
补充材料
The supplementary material includes additional experiments and visualizations, which provide further introduction of the dataset to illustrate the diverse distribution of wells. And the inclusion of the impact of NIR imagery is valuable.
与现有文献的关系
The paper sits well within the broader remote sensing and machine learning for environmental monitoring literature. The authors reference previous work on oil and gas infrastructure detection and they highlight the novelty of their dataset in terms of its scale and focus on abandoned wells, which contributes to the work on using remote sensing and machine learning for climate change mitigation.
遗漏的重要参考文献
The references in this article are adequate and specific. It fully describes the shortcomings of existing datasets in the detection of oil and gas infrastructure, and introduces the role that existing models can play on remote sensing imagery.
其他优缺点
Strengths: • The introduction of a large-scale, high-quality dataset for an impactful problem of climate change and greenhouse gas emissions. • Thorough evaluation of well-known models, with clear results and insights, to set a benchmark. • Inclusion of NIR imagery and comparison of models trained on different well types to validate the value of the dataset. Weaknesses: • Limited discussion of the potential impact of label noise on the results. • The dataset is limited to Alberta, which may limit its generalizability to other regions.
其他意见或建议
The paper is well-written and provides a valuable contribution to the field. However, adding some experiments in transfer learning based on this dataset to other regions will be better to validate the generalizability.
Thank you very much for the thoughtful and constructive review. We appreciate your recognition of the dataset’s scale, quality, and importance for climate-relevant applications, as well as your positive assessment of our methodological rigor and experimental design.
Below, we address your questions in further detail. Please let us know if there are other points you would like us to clarify.
Generalization beyond Alberta
Thank you for raising this point. While we do not claim that models trained on Alberta data will generalize zero-shot to other geographies, we believe Alberta provides a valuable and representative testbed for developing and benchmarking detection models. It is also an especially impactful region in its own right, given its position as one of the largest oil-producing regions globally. That said, we fully agree on the importance of broader applicability, and future iterations of the benchmark will include other regions to test transfer learning, as you suggest.
Impact of AER Record Incompleteness / False Negatives
Thank you for raising this important point. Like many regulatory sources, the Alberta Energy Regulator (AER) dataset is known to have occasional omissions—particularly for older, undocumented, or improperly decommissioned wells. While our dataset reflects the most complete and authoritative ground truth publicly available (from the AER ST37 database), we acknowledge the presence of potential false negatives stemming from unrecorded sites.
That said, one of our core motivations is to support exactly this gap: enabling ML models trained on known wells to help identify undocumented or potentially missing ones. In that sense, false negatives in the label set may represent real-world deployment opportunities for models rather than just noise in the training data. We are currently exploring whether it is possible to examine by hand a subset of “false” well detections from our models to see whether these in fact represent omissions in the dataset. We can endeavor to have this ready for a camera-ready version.
Precision Drop in U-Net (Table 7)
Thank you for noticing this. The drop in precision from 0.998 to 0.913 when training on all well types (compared to only active wells) is due to a tradeoff between generality and specificity. Including all three types (active, suspended, abandoned) broadens the model’s learning target, introducing more subtle and ambiguous visual patterns—especially for older, overgrown, or decommissioned sites. This leads to slightly more false positives in some regions, but substantially improves recall, F1-score, and class-wise generalization, as demonstrated in our class-specific breakdown .
This paper introduces the Alberta Wells Dataset, the first large-scale benchmark dataset for detecting oil and gas wells from satellite imagery. The dataset contains over 213,000 wells (abandoned, suspended, and active) across Alberta, Canada, represented in high-resolution (3m/pixel) multi-spectral satellite imagery from Planet Labs.
The authors frame well detection as both binary segmentation and object detection tasks, providing appropriate annotations for each approach. They evaluate several deep learning architectures as baselines, including U-Net, UperNet, FCOS, and SSD Lite. Their experiments demonstrate that including near-infrared spectral bands significantly improves detection performance compared to RGB-only data, and that training on all well types outperforms training on active wells alone.
The paper introduces a novel dataset splitting algorithm that ensures geographical diversity across training, validation, and test sets. Quality control was performed with domain experts to refine the regulatory data from the Alberta Energy Regulator.
The dataset addresses a significant environmental challenge - abandoned wells that leak methane into the atmosphere and toxic compounds into groundwater - by providing a resource for developing algorithms to detect wells that may not appear in official records.
Update after rebuttal
Thank you for your thoughtful rebuttal. I maintain my original recommendation of a weak accept. The Alberta Wells Dataset itself represents a valuable contribution to the field, particularly in addressing an important environmental challenge through the collection and annotation of high-resolution multi-spectral imagery. The dataset's creation and quality control represent significant work that will benefit the research community, which justifies acceptance despite the experimental limitations. The additional DETR results showing strong localization performance are encouraging, and I appreciate the authors' commitment to include failure case analysis in the camera-ready version. While I still believe that multi-class modeling and transfer learning to publicly available imagery would greatly strengthen the practical impact, these could be addressed in future work as the authors suggest. The core contribution of the dataset itself remains valuable and worthy of publication.
给作者的问题
None
论据与证据
The key claims in the paper are adequately supported by evidence: The dataset's scale (213,000+ wells) and composition are well-documented with detailed statistics. The performance benefits of multi-spectral imagery over RGB-only data are clearly demonstrated in Table 6, showing improved IoU and F1 scores. Similarly, Table 7 provides convincing evidence that training on all well types outperforms training on active wells alone.
However, several claims lack sufficient supporting evidence:
- The claim about Alberta's geographical diversity being sufficient for generalization is not validated with any cross-region testing.
- The comparative performance of different architectures is inconclusive since hyperparameter tuning is not covered.
- The authors don't provide evidence about the detection performance specifically for abandoned wells.
方法与评估标准
Appropriate aspects:
- Using both segmentation and object detection approaches makes sense given the nature of the task.
- Standard evaluation metrics (IoU, F1-score, mAP) are appropriate for these computer vision tasks
- The dataset splitting algorithm ensuring geographical diversity is well-designed
- Comparing RGB vs. RGB+NIR performance directly addresses the value of multi-spectral data
Limitations:
- The lack of hyperparameter optimization makes the architecture comparisons inconclusive.
- Limited data augmentation techniques (only resizing and basic flipping) don't address the full range of appearance variations.
- No evaluation of performance specifically on abandoned wells, despite their environmental significance.
- Absence of error analysis or visualization of failure cases to understand model limitations
理论论述
The paper does not make formal theoretical claims requiring mathematical proofs. This is primarily an empirical contribution focused on dataset creation and baseline benchmarking rather than theoretical advancement.
实验设计与分析
The experiments are generally sound. I list the following Issues:
- Multi-class analysis absence: Despite creating multi-class annotations (active/suspended/abandoned), the authors only performed binary detection, limiting insights about performance on environmentally critical abandoned wells specifically.
- Limited data augmentation: Only basic resizing and flipping were used, neglecting more sophisticated augmentations.
- No transfer learning evaluation: The experiments don't assess if models trained on high-resolution commercial imagery can generalize to publicly available imagery, limiting practical applicability.
- Missing error analysis: No breakdown of performance by well type or geographical context is provided, offering limited understanding of where models succeed or fail.
补充材料
No
与现有文献的关系
Oil and gas infrastructure detection datasets:
- The paper advances prior work by creating a dataset orders of magnitude larger than existing ones. Previous datasets like NEPU (Wang et al., 2021) contained just 1,192 wells, and even larger collections like the Well Pad Dataset (Ramachandran et al., 2024) only included 12,490 wells, compared to this paper's 213,447 wells.
- Benchmark datasets in remote sensing: The paper follows the model of other domain-specific remote sensing benchmarks like BigEarthNet for land use classification (Sumbul et al., 2019) and CropHarvest for agriculture (Tseng et al., 2021).
- Methane emissions detection: This work complements recent efforts to identify methane sources, such as the METER-ML dataset (Zhu et al., 2022), by focusing on the infrastructure that may be emitting methane.
- Computer vision for climate action: The work extends the growing body of research applying computer vision to climate challenges, particularly those focused on monitoring fossil fuel infrastructure.
遗漏的重要参考文献
None
其他优缺点
The biggest contribution of the paper is the acquisition & processing of (commercial) Planet Labs imagery and pairing it with quality-controlled labels. Details:
- Access to premium data: Planet Labs' high-resolution (3m/pixel) multi-spectral imagery is commercial and not freely available like Landsat or Sentinel.
- Weak supervision at scale: The authors essentially created a weak supervision pipeline by combining the Alberta Energy Regulator's records with the satellite imagery, then having domain experts refine and validate the dataset.
- Processing and standardization: They've done the heavy lifting of processing commercial imagery (standardizing well site diameter annotations at 90 meters, managing cloud cover, ensuring temporal alignment, etc.) which saves other researchers substantial effort.
The limitations of the paper:
- Problem framing: avoiding multi-class modeling conflicts with the different purposes of such data. Abandoned Wells Detection's Value is in Immediate regulatory action, Environmental emergency management, and Public health protection. While Active Wells Detection Value is in Transparency and public awareness, Energy production monitoring, and Regulatory compliance verification.
- Modeling: notable misses are: class balancing, consistency regularization, experimenting with more data augmentation techniques (like crop-zoom, noising, color jitter, rotation, etc), and hyper-parameter tuning for each architecture.
- Evaluation: lack of breakdown by well type & error analysis (failure modes).
- Impact: Environmental agencies and researchers working across multiple regions can't realistically purchase Planet imagery for large-scale monitoring, creating models that only work on premium data means the approach can't be maintained long-term without significant ongoing funding. Without demonstrating performance on public imagery (Landsat, Sentinel, etc.), there's no clear path from their research to practical deployment.
其他意见或建议
Suggestions:
- Use high-resolution Planet data to train teacher multi-class segmentation models.
- Generate pseudo-labels on the full dataset.
- Train student models that work with freely available data (e.g., Sentinel-1/2).
- Evaluate the performance trade-offs and scale predictions.
Thank you for your extremely thorough and thoughtful review. We very much appreciate the helpful and constructive feedback. We respond to specific comments and questions below:
Architecture comparisons and hyperparameter tuning:
Thank you for raising this point. We agree that fully tuning hyperparameters in all experiments could be valuable, but believe our use of standard performant settings during training nonetheless provides meaningful benchmarks, especially given that we did not observe significant sensitivity in our models.
We have obtained additional experimental results on the object detection task for the transformer-based model DETR, using a ResNet50 backbone. As shown below, DETR achieves strong localization performance, particularly at higher IoU thresholds and mAP.
Despite these performance gains, our more lightweight models (e.g. SSD Lite and FCOS) may be more usable in practice, especially for remote sensing practitioners outside of ML.
Table R1 : Object detection results on the test set. We report Intersection over Union (IoU) at thresholds 0.1, 0.3, and 0.5, as well as mean Average Precision (mAP) at IoU = 0.5 and IoU ∈ [0.5, 0.95].
| Architecture | Backbone | IoU_0.1 | IoU_0.3 | IoU_0.5 | mAP_50 | mAP_50:95 |
|---|---|---|---|---|---|---|
| FCOS | ResNet50 | 34.79 ± 0.99 | 48.51 ± 0.59 | 62.66 ± 0.43 | 9.67 ± 1.47 | 30.46 ± 3.11 |
| DETR | ResNet50 | 41.78 ± 0.11 | 51.15 ± 0.14 | 63.17 ± 0.11 | 15.22 ± 0.28 | 38.45 ± 0.31 |
Multiclass vs binary problem
Thank you for bringing up this point. We agree that there are multiple different use cases (and relevant stakeholders) associated with detection of abandoned vs active wells. However, binary detection is nonetheless very useful, as (i) determination of well types can often be done by hand, as detection is the more time-intensive step, and (ii) where centralized records for wells exist, active wells are more likely to be documented than abandoned wells, meaning that in many cases newly discovered wells are likely to be abandoned. There is also considerable noise in well status labels, and the visual distinctions between the different classes are blurry (e.g. a recently abandoned well may be difficult to distinguish from a suspended or active well).
Data augmentation
We agree that further data augmentation experiments could be helpful. We will aim to evaluate several other simple techniques in the camera-ready version, in addition to resizing and flipping.
Accessibility of Planet data
Thank you for raising this excellent point. We agree that for certain users and application settings, Planet data will be less accessible than Landsat / Sentinel imagery. However, a large number of research institutions already have subscriptions to Planet imagery. Furthermore, for many use cases, a user may wish to target a relatively narrow area (e.g. abandoned wells within a regional jurisdiction that can be targeted for plugging). For some stakeholders, it also seems likely that applying algorithms to very large areas will be limited by computational constraints (given the relatively high resolution of the images), independent of data access.
We very much appreciate your suggestion to use Planet data to train teacher models, then derive student models for lower resolution publicly available data. This sounds like a fruitful follow-up paper, which we would be happy to mention in the conclusion.
Failure analysis:
Thank you for raising this point. In the camera-ready version, we will include representative failure cases from visually complex scenarios. Most patches contain only 1–5 wells, and we observe performance degradation in rare, high-density regions. Abandoned and suspended wells—due to their subtle visual signatures and lack of surface infrastructure—can be especially challenging to detect. Enhancing model performance in these edge cases is a priority for future work.
This work proposes a large-scale remote sensing multispectral dataset for pinpointing oil and gas wells. The data comes from real scenes, and the authors carefully designed a reasonable data filtering method and data split scheme to ensure the quality of the data. This work proposes binary segmentation and object detection tasks for oil and gas well pinpointing, and uses a variety of mainstream models in the field of deep learning for training to obtain a benchmark. This large-scale benchmark has made a significant contribution to the field of climate change mitigation.
update after rebuttal
Thanks for the rebuttal, which has addressed most of my concerns. I would like to maintain my original rating.
给作者的问题
None
论据与证据
Yes.
方法与评估标准
Yes.
The amount of benchmark data proposed in this work is much larger than that of previous work, and has wider geographical area characteristics and gas well characteristics, which means the data distribution is richer.
理论论述
The 2-step clustering algorithm for dataset split.
This clustering method is based on the consistency between the distribution of oil and gas wells and geographical features. I hope the author can provide further research to prove the effectiveness of this clustering algorithm, or provide a basis for the method.
实验设计与分析
Binary detection:
Why didn't the authors add the Transformer-based model to the binary detection task for comparison as you did in the segmentation experiment? Therefore, the statement "performance in the object detection task is overall lower than for segmentation" in the analysis of the detection results on Page 7, Line 361-363 is not completely rigorous and reliable.
补充材料
The additional experiments and the qualitative results of the sample distribution.
与现有文献的关系
Previous works have used Google Earth and Sentinel-2 satellite images to construct remote sensing image datasets for oil and gas well detection. However, these datasets are small in scale, cover a small geographical area, and lack extensive data distribution. This work makes efforts to bridge the gap.
遗漏的重要参考文献
No.
其他优缺点
Strengths:
- The paper is well organized and easy to read. The proposed benchmark is targeted at the needs of the field of climate change and makes important contributions to the society.
- The dataset fully considers the diversity of data distribution in the real-world scenario, and the proposed dataset has rich distribution in both geographical features and oil well status.
Weaknesses:
- The labeling of the dataset is too rough. The oil and gas wells are directly labeled with a fixed size, which lacks quality control and is not conducive to the training of deep learning models.
- The images in the paper need to be optimized. For example, the layout of Figure 3 is messy and difficult to read.
其他意见或建议
If the license allows, I suggest further utilizing the rich gas well information in the metadata when constructing the data, and proposing multimodal oil and gas well pinpointing tasks, such as referring segmentation, which may further contribute to the community.
伦理审查问题
None
Thank you for your thoughtful and constructive review. We're grateful for your recognition of the dataset's potential impact on climate change mitigation, as well as for your helpful suggestions. Below are our responses to the points you raised:
Transformer-based models for well detection
Thank you for bringing this up. In our initial detection benchmarks, we focused on lightweight models (like SSD Lite and FCOS) because of their practical deployment relevance. However, we agree it is valuable to include transformer-based detection models for a more comprehensive comparison. We have performed experiments with the transformer-based model DETR, using a ResNet50 backbone. As shown below, DETR achieves strong localization performance, further validating the relevance of transformer-based models for this task.
Table R1 : Object detection results on the test set. We report Intersection over Union (IoU) at thresholds 0.1, 0.3, and 0.5, as well as mean Average Precision (mAP) at IoU = 0.5 and IoU ∈ [0.5, 0.95].
| Architecture | Backbone | IoU_0.1 | IoU_0.3 | IoU_0.5 | mAP_50 | mAP_50:95 |
|---|---|---|---|---|---|---|
| FCOS | ResNet50 | 34.79 ± 0.99 | 48.51 ± 0.59 | 62.66 ± 0.43 | 9.67 ± 1.47 | 30.46 ± 3.11 |
| DETR | ResNet50 | 41.78 ± 0.11 | 51.15 ± 0.14 | 63.17 ± 0.11 | 15.22 ± 0.28 | 38.45 ± 0.31 |
We also appreciate the reviewer’s point regarding the statement on Page 7, Lines 361–363 comparing object detection and segmentation performance. While earlier detection models such as RetinaNet and Faster R-CNN underperformed compared to segmentation counterparts, our updated DETR results demonstrate that transformer-based detectors can help close this gap. We will revise the manuscript to present this comparison more carefully and to avoid overgeneralizing across architectures.
Clustering algorithm
The goal of our clustering algorithm is to allow for a dataset split that avoids spatial autocorrelation across splits while still including many different locations in each split. By construction, our clusters demonstrate spatial coherence, and as shown in Figure 2 they are geographically distributed. (Note that since the wells are unevenly distributed across Alberta, a simple gridding approach would lead to significant imbalances between splits.) While we would be interested in analyzing our algorithm’s efficacy across similar geospatial datasets, we feel this would distract from the principal focus of the present paper and is not necessary to establish that the dataset splits are reasonable.
Annotation granularity and quality control
Thank you for raising this important point. We use standardized 90m circular masks and square bounding boxes for several reasons. Firstly, well pads are relatively standardized in size and shape; the 90m circular shape is typical, not merely a placeholder. Secondly, the very large size of our dataset would make the creation of human-annotated masks challenging, especially given the expert knowledge required for annotation in many cases. Finally, the imagery we use includes a near-infrared channel (and certain well pads are not visible in RGB alone), which makes human interpretation of the images harder.
Despite the coarse annotations, our results show that modern models achieve strong performance. This aligns with prior work (e.g., Rolnick et al., 2017) showing that deep learning models are robust to modest label noise. We will be exploring partial human annotations in future versions of the dataset, in collaboration with domain experts.
Figure readability
Thank you for the suggestion. We will revise Figure 3 and other figures by reorganizing subpanels, standardizing annotations, and increasing resolution to better highlight qualitative comparisons between the ground truth and predictions.
Use of metadata for multimodal extensions
Thank you for the suggestion. However, due to licensing, we can release imagery but not metadata. We’re actively exploring similar tasks using open data sources (e.g., Sentinel) to support future multimodal benchmarks.
After the discussion phase all reviewers recommended acceptance, noting that the paper is well written and that the proposed dataset/benchmark is a valuable contribution. As a result, the AC decided to accept the paper. Please take the reviewer feedback into account when preparing the camera-ready version.