PaperHub
6.6
/10
Poster4 位审稿人
最低2最高5标准差1.1
4
2
3
5
ICML 2025

DUNIA: Pixel-Sized Embeddings via Cross-Modal Alignment for Earth Observation Applications

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24

摘要

关键词
earth observationmulti-modalityself-supervised learningcross-modal retrieval

评审与讨论

审稿意见
4

The paper proposes DUNIA (Dense Unsupervised Nature Interpretation Algorithm), a method that generates pixel-level embeddings by aligning forest vertical structure information (obtained from space-born full waveform LiDAR) with satellite imagery, by using contrastive learning. The pixel-level nature is what makes DUNIA different from related works, that more typically produce patch-sized embedding. Due to the contrastive learning approach, the resulting embedding can be directly used for various EO tasks in a zero-shot fashion. Experiments are conducted, comparing e.g. with the recent AnySat approach, which show that DUNIA (in fine-tuning setting) yields performance on par with or better than state-of-the-arts on five out of six explored tasks. The embeddings of DUNIA can -- for the first time, to the best of mine and the authors' knowledge -- be used to directly generate waveforms representing the forest's vertical structure from pixel inputs.

updates after rebuttal

I thank the other reviewers and the authors for all their efforts. I have read the other reviews + associated rebuttals + the rebuttal to my review. I think overall that the authors have done a thorough job in addressing concerns (but I of course must leave it to the other reviewers to assess what they think of the responses to their respective reviews), including a significant amount of additional relevant experiments.

The original reviews were quite diverse: 1 strong accept, 2 weak reject, 1 weak accept, with my weak accept being kind of in the middle. I note that after the rebuttal, one of the weak reject reviewers (WDGb) has updated to weakly accepting the paper (instead of weakly rejecting), so it seems the majority of us believe the paper should now be accepted. I therefore score the work as "accept" now (before, "weak accept").

给作者的问题

  • I could not see statement(s) about future code and/or model availability. Will these be made publicly available? When?

  • Why was weighted F1 used, instead of the "equal-weighted" F1 score? (Or why not both?) Finally, can you provide some equal-weighted F1 scores to compare with too?

  • What is the runtime for DUNIA vs AnySat? Does this depend a lot on the number of neighbors in the kNN?

  • Would the method provide roughly as good results if doing inference on smaller-res images (e.g. 128x128) or higher (512x512), instead of the current default of 256x256?

论据与证据

Yes, I would say that most of the claims to the best of my understanding are well-backed-up; an example of such a claim that is supported is:

  • Claim #1: "In the fine-tuning setting, we show strong low-shot capabilities with performance near or better than state-of-the-art on five out of six tasks."
    • Quality of evidence for claim:
      • I think this is backed up well in the experiments section, see in particular Table 2. There we see that DUNIA obtains best results in 4 tasks, is more or less on par (82.2% vs 82.3%) for one task, and obtains quite a bit worse results (than AnySat) on one task.

One claim that is not quite as well backed up is:

  • Claim #2: At 2nd column of first page, it says that related work approaches "struggle with more complex output like the full vertical structure of vegetation". A similar statement in Sec. 2.3, where it is claimed that pixel-level alignment is necessary for dense predictions in EO applications.
    • Quality of evidence for claim:
      • I didn't quite feel that the statement was backed up, e.g. by citing some works / results that show that these related works struggle with this. This could perhaps be alleviated by in the text referring the reader to the experiment section, where e.g. the comparisons to AnySat show some of this claim (?).
      • Also, I feel like that part about pixel-level alignment being necessary for dense predictions is wrong. Many approaches (including AnySat) is used for dense predictions, despite not having pixel-level alignment.

方法与评估标准

Yes, I would say so, e.g. for these reasons (see further positives also under "Experimental Designs or Analyses"):

  • Good that many tasks (7) are explored, and that the proposed DUNIA obtains strong zero-shot improvements on most tasks, including when comparing with specialized supervised models (and really strong improvements on some -- see Table 1).

  • Continuing on the previous note, also great fine-tuning results (Table 2).

Some comments on the negative side of things:

  • It was not quite clear to me why the standard F1 score was not (also) used as an evaluation criterion, and only the weighted wF1 score.

  • I could not see any reporting of inference runtime speeds (as in actual "wall-clock" time), this should have been good to have e.g. in the supplement (D.2.2). I assume it's roughly in the same order of compute-efficiency as e.g. AnySat, but would be good to compare them. Especially interesting to see if the number of neighbors in the kNN part affects this a lot.

理论论述

No. Did not see proofs nor theoretical claims in the paper.

实验设计与分析

Yes, I had a look at all experiment designs / analyses in the paper (and where needed checked if some things that appeared missing in the main were done in the supplement).

Positives:

  • The main results in Table and Table 2 seem well-thought-through. A lot of tasks (7) explored, and many relevant methods compared against. DUNIA is best in most cases, and where it is not best, relevant commentary is added (see below).

  • I found the analysis of why performance is worse on "PASTIS" to be insightful and relevant (see beginning of left column, p7, where the reasoning is that the variability of the phonological cycles of crops cannot be well-captured by a single median composite).

  • I liked that in Supp. D.3.5 an analysis of the embedding sensitivity to horizontal and vertical structures was included, as it is such a core methodological contribution of the paper (and as shown in the corresponding Table 7, the proposed design of cross-modal alignment with vertical structure data is important).

Negatives:

  • I was missing some ablation or similar into the effect of importance of using both S-1 and S-2 data. In the supplement, for example, there could have been room for trying DUNIA while omitting one of the two.

  • While it's good that limitations are made transparent (see latter part of Sec. 5), I find that the comment on the reliance on timeseries data could anyway have been explored a bit, empirically speaking. Because to the best of my understanding, image timeseries were anyway processed in such a way that it in practice could be replaced with a single image (albeit at an expecation of worse results). Thus it would have been interesting to see what happens if one uses a single image instead (and it would still be OK if results get much worse).

  • Around L291 in main paper, it is mentioned that fixed-resolution imagery (256x256 pix) is used during inference. I think that it would have been good in the supplement to provide some insight into how results may be affected if using another resolution, as comparison.

补充材料

I skimmed all of it, put some extra attention on these parts:

  • D.3.5 (see my commentary on it in previous question's box).

  • D2.2 (I was looking for runtime inference speeds but did not find it).

  • In general looked through all ablations when trying to look for results that I found missing in the main paper (see also previous question's box).

与现有文献的关系

To the best of my knowledge, the related work mentioned looks good and covers the necessary literature. In particular, I think lots of relevant related work is covered in the beginning of page 2 (and it's particularly good to mention how those works are various ways to remedy the challenges listed towards the end of the previous page (p.1)), followed also in Sec. 2.1 and 2.2. In addition, an extended set of related work is provided in the supplement.

As for the key contributions of this paper, I think the section 2.3 covers it well. In particular, this work builds on earlier works that develop cross-modal (e.g. AnySat) ML-EO methods, but makes a strong contribution relative to prior works in that DUNIA (for the first time, at least to the best of my knowledge, and based on the authors' claims) due to its pixel-level embeddings can directly approximate forest vertical structure from pixel inputs.

遗漏的重要参考文献

To the best of my knowledge, no essential references were missed; in particular, no references that made it hard for me to understand this paper were missed.

其他优缺点

Strengths:

  • I re-iterate this as one of the main strengths of this work: The embeddings of DUNIA can -- for the first time, to the best of mine and the authors' knowledge -- be used to directly generate waveforms representing the forest's vertical structure from pixel inputs.

  • I think in general that the design choices etc of the method have been carefully thought through. An example of many that illustrate this is in Sec. 3.4, where the reasoning about when and why a certain alignment loss is used makes sense (also, such things are well-ablated in the supplement as well).

  • It's good that limitations are lifted in the discussion section (e.g. admitting the reliance of cases where timeseries are available).

  • The impact statement on p.9 is important and good. It really shows the importance of this line of work.

Weaknesses:

  • Not all parts were quite clear to me, e.g.
    • In Sec. 3.3, I got most of the explanation about that composite image I from S1 and S2 timeseries. What I did not get was how the model is supposed to perform reasonably well in cases where it does not have composite images but just "simple / plain" images. I.e., if it's trained assuming access to such "privileged information" as composite images, how can it be expected to not get a "performance drop" when that input changes to something less "rich"? EDIT: I keep this weakness, even though it was later made clear (in the listing of limitations in Sec. 5) that this is one of the weaknesses of the approach. But I think this should be more clearly stated earlier as well, as is apparent from the fact that I got confused about it.
    • In the abstract it reads (L33) "... outperform specialized supervised models, even in low-labeled data regimes". This formulation was a bit surprising to me, given that it is my understanding that this is the most expected case, and is thus not surprising at all (since supervised models often require high-labeled data regimes to work).. (?)

其他意见或建议

Some minor things such as typos:

  • Be consistent in "Earth observation" vs "Earth Observation". Pick one. Both work.

  • Be consistent in the way Sentinel-1 and -2 are abbreviated, e.g. in Fig. 2 caption it is written as "S-1 & S2" <-- write in one of the two ways all the time.

  • I suggest "words" in math env to not be italic, e.g. in eq. (2) mse could be written in non-italics (and perhaps capital letters for that is mor common).

  • When referring to figures in the SM (e.g. Fig 5 to 8 right before Sec. 4.2.2), please state that they are in the SM.

作者回复

First, we wish to thank reviewer zT7y for their complete & thorough review of our submission.

1.Unsupported claims

1. Concern (C): The struggle to directly estimate the full vertical structure

Response (A): Reconstructing the full vertical structure (W) requires modeling a complex distribution P(W|Pix). A pixel embedding (Pix) aggregates spectral information over a 2D footprint but lacks the explicit depth-wise cues needed to resolve W directly (e.g. using a decoder head). In contrast, LDMs learn to iteratively denoise samples toward a plausible W, conditioned on aligned pixel embeds. By this process, LDMs implicitly capture the conditional distribution P(W|Pix) in a more expressive manner. We have revised accordingly.

2. C: Pixel-level alignment necessary for dense predictions

A: We agree with the reviewer that in the fine-tuning case, pixel-level alignment is not necessary for dense prediction tasks, as usually a decoder is trained on the generated patch-sized embeddings. We revised accordingly.

2. Methods And Evaluation Criteria

1. C: The use of the weighted F1 score

A: We relied on this metric to remain consistent with the performance scores reported in the literature. The authors of the PF dataset used the weighted F1 score due to high class imbalance. For the PASTIS data set and the CLC+ datasets, OA accuracy has been reported. For these reasons, we opted for the weighted F1 score. Due to lack of time, we only managed to run partial fine-tuning tests. The results indicate that the micro- and weighted- F1 scores have the same order of magnitude. For the macro F1 score, it is several percentage points lower (e.g., ~7% lower for the PF dataset). However, model ranking remains the same regardless of the score.

2. C: Inference runtime speeds of DUNIA and AnySat

A: Due to design choices and data requirements, AnySat is orders of magnitude slower than DUNIA zero-shot. Below are the wall clock times (in seconds) to generate a ~20x20 km area (~4.19M pixels), assuming a retrieval database containing 256K keys/values for DUNIA and 100 NNs. The test excludes data loading times.

Modelforward passretrievalKNNTotal
DUNIA2.520.361.344.22
AnySat177.37--177.37

For a database with 512K K/V, retrieval increased to 0.4s. For NN = 200, KNN increased to 1.88s

3. Experimental Designs Or Analyses

1. C: Effect of importance of using both S-1 and S-2 data

A: In the limited time that we had for the rebuttal, we only managed to test on heights and land cover classification (CLC+). Below are the results:

DatasetMetricS-1 onlyS-2 onlyS-1 & S-2
Heightsrmse3.82.81.34
CLC+wF190.290.090.3

The results show that DUNIA leverages both modalities for heights and, in general, using S-2 yields more accurate results than S-1 for heights. For land cover classification, using either modality leads to similar performances. In the revised version, we will include the remaining datasets.

2. C: Reliance on timeseries data could anyway have been explored a bit

A: We have revised Sections 3 (Approach) and 4.1.1 (first paragraph) to better reflect this point. More specifically, while image composites such as a median composite may be richer than single-date images, they are still less informative than a full time series, as they only provide a median reflectance value over a given period. On the other hand, even this simple form of aggregation preserves significantly more information than a single-date image. Below are the fine-tuning results for two products (Heights, and PASTIS) using randomly acquired-in-time S-1 & S-2 imagery.

DatasetMetricSingle-date imageMedian composite
Heightsrmse1.91.34
PASTISwF142.377.0

3. C: Issue raised regarding the effect of image resolution

A: We agree with the raised concern. Our tests show that image size has no effect on the quality of the resulting product. Below are performance results for different resolution images. The tests were performed in the zero-shot setting:

Datasetmetric128x128256x256512x512
Heightsrmse2.22.02.1
Coverrmse11.611.711.7
CLC+wF180.280.180.2
4. Weaknesses

1. C: How the model is supposed to perform reasonably well in cases where it does not have composite images

A: Please refer to 3.2.

2. C: The statement: outperform specialized supervised models, even in low-labeled data regimes is wrong.

A: Yes we agree with the raised concern. We have revised accordingly.

5. Questions For Authors

1. C: statement(s) about future code and/or model availability

A: We thank the reviewer for raising this point. We are currently preparing the code for publication. We will link to the repository in the finalized version.

Other questions

A: For Qs 2,3, and 4, please refer to 2.1, 2.2, and 3.3 respectively.

审稿人评论

I thank the other reviewers and the authors for all their efforts. I have read the other reviews + associated rebuttals + the rebuttal to my review. I think overall that the authors have done a thorough job in addressing concerns (but I of course must leave it to the other reviewers to assess what they think of the responses to their respective reviews), including a significant amount of additional relevant experiments.

The original reviews were quite diverse: 1 strong accept, 2 weak reject, 1 weak accept, with my weak accept being kind of in the middle. I feel especially that one of the weak reject reviewers (WDGb) was open to accepting the paper if the weaknesses were well-addressed, so I feel there is a high possibility that there could become a consensus on accepting the paper. But we will see what the others think, too.

作者评论

Thank you for your comments. Please allow us to clarify a few points:

Regarding reviewer WDGb, although they have not officially commented on our rebuttal, they raised their score from 2 to 3. We interpret this as a sign that they were generally satisfied by our responses to their concerns.

As you noted, reviewer hPct requested several additional experiments, which we performed. Today, reviewer hPct acknowledged our rebuttal, although without providing an answer yet. To summarize the responses to reviewer hPct’s key requests:

  • We compared our model against four specialist canopy height models, added an additional dataset, and demonstrated that we outperform specialists even in the zero-shot setting and using <1% of the labels.

  • We included three additional competing models, despite earlier findings that they perform poorly on forest monitoring tasks. One model wasn't included as it is closed source and not publicly accessible.

  • We evaluated geographical robustness on the suggested dataset and outperformed all baselines.

  • We assessed temporal robustness via ablations and datasets with available labels across the years. Results demonstrate the model's stability in time.

These new experiments reinforced our original claims and did not alter our conclusions.

Finally, a central contribution of our work, as highlighted by all reviewers, is the native integration of LiDAR waveform data at the pixel scale. This enables accurate zero-shot classification and full vertical structure estimation—an important extension to the capabilities of remote sensing foundation models. This extension helps fill a critical gap in forest monitoring and supports broader ecological conservation efforts. We also show that integrating LiDAR is crucial for these tasks: our model outperforms five recent and prominent foundation models on vertical structure estimation and demonstrates strong performance on other land cover and land use tasks as well.

We also emphasize that this is a general framework, not a product, that is efficient to pretrain with modest compute requirements.

We hope this response clarifies the key points and addresses any remaining concerns. We would be grateful if you would consider this in your final evaluation.

审稿意见
2

The paper presents DUNIA (Dense Unsupervised Nature Interpretation Algorithm), an approach for learning pixel-sized embeddings for Earth observation applications through cross-modal alignment between satellite imagery and LiDAR data. The main contributions include a framework that learns dense pixel-level embeddings by aligning forest vertical structure information from LiDAR waveforms with satellite imagery using contrastive learning. This enables both vertical and horizontal structure understanding at the pixel level. The proposed method can perform zero-shot predictions for multiple forest/land monitoring tasks

给作者的问题

Please refer to the above responses

论据与证据

The paper's claim that it "often outperforms specialized supervised models" is partially supported by the evidence presented, though the scope of comparison could be broader. While the results demonstrate superior performance in several cases, the limited set of supervised models used for comparison somewhat weakens the generality of this claim. A more comprehensive benchmarking against recent specialized models would provide stronger support for this assertion. A few examples include Lang et al., 2023, Fayad et al., 2023; Tolan et al., 2024

  • Lang, Nico, Walter Jetz, Konrad Schindler, and Jan Dirk Wegner. "A high-resolution canopy height model of the Earth." Nature Ecology & Evolution 7, no. 11 (2023): 1778-1789.
  • Fayad, Ibrahim, Philippe Ciais, Martin Schwartz, Jean-Pierre Wigneron, Nicolas Baghdadi, Aurélien de Truchis, Alexandre d'Aspremont et al. "Hy-TeC: a hybrid vision transformer model for high-resolution and large-scale mapping of canopy height." Remote Sensing of Environment 302 (2024): 113945.
  • Tolan, Jamie, Hung-I. Yang, Benjamin Nosarzewski, Guillaume Couairon, Huy V. Vo, John Brandt, Justine Spore et al. "Very high resolution canopy height maps from RGB imagery using self-supervised vision transformer and convolutional decoder trained on aerial lidar." Remote Sensing of Environment 300 (2024): 113888.

Similarly, while the paper demonstrates promising low-shot learning capabilities by showing good performance with reduced training data, this aspect of the research could be more thoroughly explored. The analysis would benefit from more extensive comparisons across different data regime sizes to better understand the model's behavior with varying amounts of training data. Additionally, there is limited discussion or analysis explaining why the model performs well in low-data scenarios.

方法与评估标准

The paper's robustness testing could be enhanced, particularly regarding the model's behavior across different years. While the current evaluation demonstrates effectiveness within a specific temporal window, there is limited discussion about how the model performs when applied to data from different years, which is crucial for understanding its long-term applicability in Earth observation tasks.

Additionally, concerning the datasets used for land cover classification, while they serve their purpose for the current evaluation, expanding the validation to include more diverse land cover datasets would strengthen the model's claimed generalization capabilities. In addition, I strongly recommend that the authors compare the proposed method with well-established remote sensing foundational models, such as SkySense (CVPR 2024), SatMAE++ (CVPR 2024), and DeCUR (ECCV 2024), using recognized benchmarks such as BigEarthNet, fMoW, and DIOR.

理论论述

Not Applicable

实验设计与分析

Yes

补充材料

Yes

与现有文献的关系

The paper's contributions advance Earth observation research by building on developments in the field. On the foundation model front, it expands upon work like Scale-MAE (Reed et al., 2023) and DOFA (Xiong et al., 2024) in handling multi-resolution data, while also advancing the multi-source data integration approaches seen in OmniSat (Astruc et al., 2025) and AnySat (Astruc et al., 2024).

遗漏的重要参考文献

The paper overlooks several contributions in the field of tree canopy height mapping. Notably absent is a discussion of recent deep learning-based methods developed by Liu et al. (2023) and Lang et al. (2023), as well as important advances in vision transformer applications by Fayad et al. (2023) and Tolan et al. (2024).

  • Liu, Siyu, Martin Brandt, Thomas Nord-Larsen, Jerome Chave, Florian Reiner, Nico Lang, Xiaoye Tong et al. "The overlooked contribution of trees outside forests to tree cover and woody biomass across Europe." Science Advances 9, no. 37 (2023): eadh4097.
  • Lang, Nico, Walter Jetz, Konrad Schindler, and Jan Dirk Wegner. "A high-resolution canopy height model of the Earth." Nature Ecology & Evolution 7, no. 11 (2023): 1778-1789.
  • Fayad, Ibrahim, Philippe Ciais, Martin Schwartz, Jean-Pierre Wigneron, Nicolas Baghdadi, Aurélien de Truchis, Alexandre d'Aspremont et al. "Hy-TeC: a hybrid vision transformer model for high-resolution and large-scale mapping of canopy height." Remote Sensing of Environment 302 (2024): 113945.
  • Tolan, Jamie, Hung-I. Yang, Benjamin Nosarzewski, Guillaume Couairon, Huy V. Vo, John Brandt, Justine Spore et al. "Very high resolution canopy height maps from RGB imagery using self-supervised vision transformer and convolutional decoder trained on aerial lidar." Remote Sensing of Environment 300 (2024): 113888.

Furthermore, while the paper focuses on single-year height estimation, it fails to acknowledge important work on multi-year time series analysis by Dixon et al. (2025), Kacic et al. (2023), and Turubanova et al. (2023). Including these references would provide a more comprehensive understanding of how the proposed approach advances or differs from existing temporal analysis methods in forest structure mapping.

  • Dixon, Dan J., Yunzhe Zhu, and Yufang Jin. "Canopy height estimation from PlanetScope time series with spatio-temporal deep learning." Remote Sensing of Environment 318 (2025): 114518.
  • Kacic, Patrick, Frank Thonfeld, Ursula Gessner, and Claudia Kuenzer. "Forest structure characterization in Germany: novel products and analysis based on GEDI, Sentinel-1 and Sentinel-2 data." Remote Sensing 15, no. 8 (2023): 1969.
  • Turubanova, Svetlana, Peter Potapov, Matthew C. Hansen, Xinyuan Li, Alexandra Tyukavina, Amy H. Pickens, Andres Hernandez-Serna et al. "Tree canopy extent and height change in Europe, 2001–2021, quantified using Landsat data archive." Remote Sensing of Environment 298 (2023): 113797.

其他优缺点

STRENGTHS

  • Novel integration of horizontal and vertical structure understanding at pixel level
  • Creative combination of different contrastive learning approaches for different alignment tasks
  • Practical Utility: Addresses real-world challenges in Earth observation

WEAKNESSES:

  • Limited evaluation scope: restricted comparison with recent specialized models, limited geographical coverage (mainly French territory), and insufficient robustness testing across different years
  • Methodological gaps: limited analysis of low-data regime behavior, insufficient discussion of failure cases, and lack of cross-validation across different environmental conditions
  • Insufficient support for generalization claims: needs stronger evidence for outperforming specialized models, limited validation across diverse land cover datasets, and insufficient analysis of performance in different geographical regions

其他意见或建议

No

作者回复

We thank reviewer hPcT for the thorough review, comments, and suggestions.

1. Claims and Evidence

1. Concern (C): The paper's claim that it "often outperforms specialized supervised models" is partially supported by the evidence presented

Answer (A): We agree and have included performance results in comparison to: Lang, Fayad, Liu and Tolan (et al.). We have also added an additional dataset for this task (i.e., ALS). The results below clearly show that our model outperforms all current specialist canopy height (CH) models even in the zero-shot (ZS) setting.

Height DatasetMetricDUNIA (ZS)TolanLiuLangFayad
ALSrmse3.19.23.96.53.9
GEDIrmse2.08.55.25.62.8

2. C: The analysis would benefit from more extensive comparisons across different data regime sizes to better understand the model's behavior

A: We chose the sizes based on empirical findings. We presented the lowest data size below which the products were unusable for a given task. In the revised version, we have added an ablation on lower data regimes. All models showed a degradation in performance. However, the model ranking remained the same as that presented in the original submission. Failure cases include overfitting or convergence towards a worse solution.

3. C: There is limited discussion or analysis explaining why the model performs well in low-data scenarios

A: This is mainly due to the self-supervised training with a contrastive objective. In the main text (line 69), we have included two references that discuss it, and our results corroborate their findings.

2. Methods And Evaluation Criteria

1. C: I strongly recommend that the authors compare the proposed method with well-established remote sensing foundational models

A: We had pre-trained DOFA, SatMAE, and DeCUR. Initially, they were not included due to their performance gaps on vertical structure related tasks. Due to time constraints we couldn't pretrain SatMAE++, and we couldn't find any implementations of SkySense. Below are the non-linear probing (MLP) results for our model (DUNIA), DOFA, SatMAE, and DeCUR. All models were probed concurrently on the same dataset. The results show that these models severely underperform (except for the PF data set) compared to our model.

DatasetMetricDUNIASatMAEDeCURDOFA
Heightrmse1.3410.511.011.0
Coverrmse9.830.228.529.2
CLC+wF190.375.075.172.0
PFwF182.279.878.978.8

2. C: Benchmarking against recognized benchmarks (e.g., BigEarthNet) to assess robustness across different years and in different geographical regions.

A: We thank the reviewer for this suggestion. First, we would like to point out that BigEarthNet is a scene understanding task (i.e., an image embedding (not pixel) is used to predict a multi-label). As such, patch-based models are expected to perform better than pixel-based ones. The results below show that even though DUNIA's encoder was not pre-trained on the BigEarthNet dataset, and scene understanding is not the intended use case for DUNIA, yet it compares favorably with the other FMs.

DatasetMetricDUNIACROMASatMAEDeCURDOFA
BigEarthNetmAP84.984.382.182.883.5
3. Essential References Not Discussed

A: Thank you for the suggestion. We have updated our manuscript accordingly. Please see 1.1.

4. Weaknesses

1. C: Restricted comparison with recent specialized models

A: Please see 1.1.

2. C: Limited geographical coverage (mainly French territory)

A: Please see 2.2.

3. C: Insufficient robustness testing across different years

A: We thank the reviewer for this comment. We have performed two test cases:

  1. Use the pre-trained model for year 2020 and fine-tune it (MLP) for 2019 and 2021
  2. Pretrain the model using data from three years: 2019, 2020 and 2021.

For both cases, performance results on canopy height mapping (below) show no significant differences year-to-year in either scenario.

Test case 1: Below is the fine-tuning performance on height estimation:

DatasetMetric201920202021
Heightrmse1.351.341.40

Test case 2: Below is the zero-shot performance on height retrieval:

DatasetMetric201920202021
Heightrmse2.42.02.1

4. C: Limited analysis of low-data regime behavior

A: Please see 1.2.

5. C: Cross-validation across different environmental conditions

A: We compared DUNIA in the Zero-Shot setting (ZS) across France's 10 major ecological regions (GRECOs A to J). These regions span 12 of the 18 Köppen-Geiger climate types found across Europe and geographically range from lowland plains and coastal plateaus to mountainous terrains. Standard deviation on the height, cover, and CLC+ metrics across the 10 GRECOS regions was respectively 0.28 m, 0.26%, and 0.68%. Indicating similar performance across different environmental conditions.

审稿意见
3

The paper introduces DUNIA, a new approach to learn pixel-level embeddings of Earth observation images. DUNIA is trained contrastively, aligning satellite images with full-waveform LIDAR data to enable understanding of both “vertical” and “horizontal” structures. Experiments measure the effectiveness of the embeddings in seven environmental monitoring tasks. They find the embeddings enable zero-shot classifiers to perform comparably to or outperform supervised specialists, and strong fine-tuning performance with low amounts of data compared to state-of-the-art.

Update after rebuttal

The authors have addressed most of my concerns, so I have adjusted my score accordingly.

I still think the paper has problems with methodological motivation and clarity, and presentation more broadly, which are why I haven't increased my score further. For example, why are two autoencoders used (encoder-decoder architectures) rather than simply two encoders with the third alignment model? Is there intuition for why the RVQ module is needed/helpful beyond what the authors state as "regularization by enforcing discrete representations" (which isn't clear to me why it's needed for this task)? These are just a few examples of design decisions that I don't think are clearly motivated, and there are many design decisions made in the paper. For decisions the authors are newly making, intuition should be provided. If designs are well-studied and motivated in prior work, that should be clearly stated and cited.

I also think the Tables are still hard to read. For example there is inconsistent use of decimal formatting, inconsistent use of entries with values in parentheses, and lots of acronyms/shorthands which I don't think are necessary or could be made more clear.

给作者的问题

  1. Table 1 suggests DUNIA underperforms on PASTIS significantly - why do the authors think that is?
  2. Could you provide more intuitive explanations or visual examples that highlight why specific methodological choices were made?

Strong responses to the weaknesses as well as my other comments and questions could result in an improvement in my evaluation of the paper.

论据与证据

The claims are supported by clear and convincing evidence.

方法与评估标准

The proposed methods and evaluation criteria make sense for the task of learning pixel-level EO embeddings. Encoding vertical information in the pixel-level embeddings through the contrastive procedure is an interesting idea and the authors validate it on seven tasks spanning four datasets where strong pixel-level information is required for good performance. However, there may be at least one baseline that should be compared against for proper contextualization in the literature.

理论论述

No theoretical claims were made in this work.

实验设计与分析

The experimental designs and analyses presented in the main text are sound.

补充材料

I skimmed through the entire Appendix. The supplementary material is extensive and provides additional details that support the main text, including more on experimental setups and additional results. The ablation studies contained within are particularly useful for understanding the impact of different design decisions, and a few qualitative examples are presented as well.

与现有文献的关系

The contributions of this paper have implications for how to design strong pixel-level embeddings, which have several applications in remote sensing / EO. The idea to contrast LIDAR data with satellite imagery is new as far as I am aware, and it may inspire future research on self-supervised learning for pixel-level (or more broad self-supervised learning) EO data.

遗漏的重要参考文献

One notable self-supervised method [1] learns a pixel-level encoder for satellite imagery. This should be discussed, and potentially even compared to in the results for proper contextualization in the literature. Additionally, it does not seem like the paper has a “self-supervised learning for EO” related work section, which further takes away from its positioning in the literature and makes it difficult for readers to assess the key differences and contributions from prior work.

[1] Lightweight, Pre-trained Transformers for Remote Sensing Timeseries. Tseng et al. 2023.

其他优缺点

Strengths:

  1. Introduces a novel technique for generating high-resolution, pixel-sized embeddings of EO data.
  2. The method demonstrates broad applicability across a range of environmental monitoring tasks, sometimes outperforming specialists.
  3. Provides extensive supplementary material that supports and extends the main findings.

Weaknesses:

  1. Little-to-no discussion of self-supervised EO related work.
  2. The presentation needs work overall. For example:
    1. There is a lack of intuitive explanations for methodological choices. This makes it difficult for readers to understand why the authors made certain decisions, putting them into question.
    2. Some figures and tables are unclear and difficult to interpret, which could hinder understanding.
  3. Minimal qualitative examples which makes it hard for readers to gain a sense of how their method performs on specific examples compared to other methods.

其他意见或建议

  1. The methodology section would benefit from more intuitive descriptions and examples to aid understanding.
  2. Including a small figure to clearly demonstrate the advantages of pixel vs. patch-sized embeddings would be helpful. Perhaps revamping Figure 1 to do this could make sense.
  3. Tables 1 and 2 are really hard to read and should be restructured to improve readability.
  4. Figure 2 has a ton of details which makes it hard to follow. Are all of these details necessary, or can some be removed in favor of focusing on the important aspects and simplifying the whole figure?
作者回复

We would like to thank reviewer WDGb for the thorough review and the helpful comments.

1. Methods And Evaluation Criteria

1. Concern (C): There may be at least one baseline that should be compared against for proper contextualization in the literature.

Answer (A): Thank you for this comment, also raised by hPct. We have added four additional baselines for the canopy height mapping task and one additional dataset (canopy height from ALS). Regarding the foundation model baselines (FMs), we have benchmarked against three new recently proposed FMs. We invite reviewer WDGb to read our responses to reviewer hPct regarding the performance results.

2. Essential References Not Discussed

1. C: It does not seem like the paper has a “self-supervised learning for EO” related work section.

A: We agree and thank you for raising this point. Originally, related works on self-supervised models for EO were cited based on their pre-training objectives. In the revised version, we have included a dedicated section for these models. This section now discusses the previously mentioned models, the new models that were included as part of our response to reviewer hPct and Presto, the model mentioned by the reviewer.

3. Weaknesses

2.1 C: There is a lack of intuitive explanations for methodological choices

A: Thank you. We would be grateful if the reviewer could guide us towards the passages where we failed to deliver a convincing argument for a given design choice. In the original submission, we discussed the following design choices:

  1. Tokenization layer.
  2. Encoder and decoders.
  3. Losses.
  4. The choices for the two AEs.
  5. Waveform generation.

We realized that the reliance on image composites (i.e., median reflectance of a time series) was not entirely clear, instead of using a time series as input. This has been addressed.

2.2 C: Some figures and tables are unclear and difficult to interpret

A: Please see our responses to your concerns 4.3. and 4.4. In particular, we have:

  1. Modified Figure 2.
  2. Modified Tables 1 and 2.

3. C: Minimal qualitative examples.

A: We agree. In the revised version of our manuscript, we have split Figure 3 into three figures and included additional maps / product. We hope that this change satisfies this concern.

4. Other Comments Or Suggestions

1. C: The methodology section would benefit from more intuitive descriptions

A: Please refer to our response in 3.2.1

2. C: Including a small figure to clearly demonstrate the advantages of pixel-sized vs. patch-sized embeddings

A: Thank you. We have included a new figure in the Appendix of the revised version, which shows the difference between the two variants and the loss of detail induced by patch-sized embedding models.

3. C: Tables 1 and 2 are really hard to read and should be restructured to improve readability

A: We appreciate your suggestion. In the revised version, Tables 1 and 2 are now four tables. Please refer to our response to reviewer 7y1C (Concern 1.2) on this subject.

4. C: Figure 2 has a ton of details which makes it hard to follow

A: We have simplified Figure 2, only keeping the main blocks that help to understand the methodology. The original Figure 2 has been moved to the Appendix as a reference.

5. Questions For Authors

1. C: Table 1 suggests DUNIA underperforms on PASTIS significantly - why do the authors think that is?

A: Our objective with DUNIA is to ensure accessibility and broad applicability. Thus, several compromises had to be made. One of them was not to rely on time series (TS) data as input, but rather a median composite that still retains some phenological information. As an advantage, this would:

  1. Alleviate the need to store large volumes of data.
  2. Broader applicability, especially over areas with persistent cloud cover, where having a TS over a given area would not be possible.
  3. Make the model more efficient, as it only processes a single image instead of a full TS.

The trade-off is that a simple aggregation function like the median inevitably captures less temporal information than a dedicated TS-aware module. However, in six downstream tasks, we show that TS input is not always necessary to achieve strong performance. We believe that our approach has a good balance between accuracy, efficiency, and accessibility, even if it results in lower performance on datasets like PASTIS, which would strongly benefit from richer temporal information.

2. C: Could you provide more intuitive explanations or visual examples that highlight why specific methodological choices were made?

A: In the revised version, we have added an additional figure to illustrate the advantages of pixel-based versus patch-based embeddings. We have also clarified our reasoning regarding the use of composite imagery vs. TS. If the reviewer feels that further justifications are needed, we will be happy to provide additional clarifications.

审稿意见
5
  • The paper introduces DUNIA, a novel framework that generates pixel-level embeddings through cross-modal alignment between Sentinel-1 & 2 imagery and LiDAR waveform data.
  • The model incorporates several components, including a multi-modal pre-training model, two autoencoders, dual decoders with neighborhood attention, contrastive losses (Zero-CL and VICReg), and a latent diffusion model for waveform generation.
  • Extensive experimental evaluations demonstrate the framework's effectiveness in multiple downstream Earth observation tasks, achieving state-of-the-art results in zero-shot and low-shot learning scenarios.

给作者的问题

  • The methods section could be improved by adding a brief subsection that explains the overall flow of the proposed pipeline in a simplified manner. Currently, it relies on a complex figure and separate explanations of individual modules, leaving the actual flow of the method unclear.
  • Tables 2 and 3 could be refined; as they currently appear to present a large amount of information without clear organization. It is not clear why the authors selected FMs AnySat and CROMA for comparison. Is it because these are the only two models that use - Sentinel-1 and Sentinel-2 data in their pre-training? If other models also utilize this data, an explanation for choosing only these two would be helpful.

论据与证据

  • The paper's claims about the efficiency and effectiveness of the proposed method are well-supported by extensive experiments. The authors provide detailed quantitative results across multiple datasets and tasks, clearly validating the benefits of their pixel-level embeddings and cross-modal alignment strategies.

方法与评估标准

The authors have carefully selected relevant datasets, performance metrics, and models to construct their framework, reinforcing the robustness and applicability of their approach.

理论论述

N/A

实验设计与分析

The experimental design is valid, and the analysis is discussed in comprehensive detail, reinforcing the credibility of the approach.

补充材料

Yes, I have read the supplementary material whenever additional details were needed, as referenced in the main paper.

与现有文献的关系

The paper not only delivers significant performance enhancements but also rigorously benchmarks against established literature baselines, underscoring its substantial potential to advance satellite-based tasks.

遗漏的重要参考文献

The authors provide a comprehensive discussion of related work, supported by an extensive list of references.

其他优缺点

*Strength

  • Comprehensive combination of diverse, advanced modeling techniques including multi-modal pre-training, dual decoders, and latent diffusion models.
  • The paper is well-written and structured, making it easy to follow the detailed analyses and experimental evaluations.

其他意见或建议

N/A

作者回复

First, we thank the reviewer for their feedback and their appreciation of our work. The following are our responses to your comments.

1. Questions For Authors

1. Concern (C): The Methods section could be improved by adding a brief subsection that explains the overall flow of the proposed pipeline in a simplified manner. Currently, it relies on a complex figure and separate explanations of individual modules, leaving the actual flow of the method unclear.

Answer (A): This concern was also raised by WDGb.

  1. We simplified the Methods section and revised the first paragraph of Section 3 (Approach) to make it easier for the reader to follow.
  2. We have simplified Figure 2, only keeping the main blocks that aid in understanding the methodology. The original Figure 2 has been moved to the Appendix as a reference.

2. C: Tables 2 and 3 could be refined; as they currently appear to present a large amount of information without clear organization.

A: This comment was also raised by WDGb. Due to space limitations, we had to format them as they were so that they could be included in the main text. In the revised version, Table 1 has been split into two tables: the first table now includes only the default configuration for the zero-shot classifier (i.e., sample size S and different KNNs), while the second table includes the modifications to the zero-shot classifier. Table 2 has also been split in a similar fashion.

3. C: It is not clear why the authors selected FMs AnySat and CROMA for comparison. Is it because these are the only two models that use - Sentinel-1 and Sentinel-2 data in their pre-training? If other models also utilize this data, an explanation for choosing only these two would be helpful.

A: Although data requirements were a factor in selecting the models we compared against, the decision to include these two models was based on the type of EO applications they target, performance compared to other EO foundation models, and also their recency and novelty. However, following the comments of the reviewer hPct, we added three more models as baselines namely: DOFA, SatMAE, and DeCUR. We invite reviewer 7y1C to see our response to reviewer hPct for the comparison results.

Again, thank you for your time reviewing our paper.

最终决定

This paper received divergent scores after the rebuttal. Reviewers 7y1C, zT7y, and WDGb rated positive scores, while reviewer hPct remained at Weak Reject.

All reviewers praised the paper for its significance of the framework that generates pixel-level embeddings by effectively integrating Sentinel-1/2 imagery with LiDAR waveform data. They liked that this work not only advances cross-modal alignment methods but also enables strong zero-shot and low-shot performance on multiple Earth observation tasks. In addition, the authors successfully resolved part of the issues regarding methodological clarity. However, there are still concerns regarding presentation issues, particularly the readability of figures and tables, and the overall theoretical motivation behind some architectural decisions.

After carefully reviewing the paper, the reviews, and the rebuttal, the AC acknowledges that the concerns raised by reviewers WDGb and hPct are valid. Nonetheless, the paper shows sufficient technical novelty and strong performance in diverse environmental monitoring applications. Based on these considerations, the AC recommends a Weak Accept, with the expectation that the authors will carefully address the comments in the final version of the paper.