Combining Observational Data and Language for Species Range Estimation
TL;DR: We developed a method for creating species range maps by integrating citizen science data with Wikipedia text, enabling accurate zero-shot and few-shot range estimation.
摘要
评审与讨论
The paper introduces LE-SINR, a novel approach for estimating species range maps (SRMs) by combining citizen science observations with textual descriptions of species from Wikipedia. The proposed framework, an extension of previous work SINR, uses two branches one for location and one for text, for predicting the likelihood of a species being present at a given location and for a given textual description respectively. The location branch additionally incorporates learnable species embeddings to enable training in a fully-supervised setting. The work demonstrates LE-SINR outperforms baseline models in zero-shot range estimation, indicating its ability to generate plausible range maps from textual descriptions. Furthermore, LE-SINR shows strong performance in few-shot range estimation, highlighting its capacity to leverage textual information for enhanced accuracy when observational data is limited. The study underscores the potential of incorporating textual data for improving the accuracy and efficiency of species range estimation, with implications for ecology research, conservation, and planning.
优点
- The proposed framework is one of the earliest work to incorporate textual descriptions for species distribution modeling allowing zero-shot species mapping. The framework combines two training objectives: weakly-supervised learning (through text) and fully-supervised learning (through learnable species embeddings).
- The authors compiled a novel dataset using Wikipedia articles containing a total of 37,889 species articles. In the future, the dataset can not only be used for other ecological tasks but also as a training/benchmarking dataset for large language models in the ecology domain.
- The model has good few-shot and zero-shot performance as compared to previous baseline, SINR.
- The flow of the paper in general is good and easy to understand.
缺点
- My main concern is that authors present zero-results of LE-SINR using range-text (includes region names) or habitat-text as input. The entire idea of species distribution modeling is to predict range of a given species. By providing range/habitat as text, the model can just "cheat". Further, as an end-user application, this kind of text may not be provided as input by a user.
- In my opinion, some key results are missing from the paper:
- L145-L147: "training both species representations jointly, we are able to achieve improved zero-shot performance". By how much? Is it significant?
- Fully-supervised/orcale results for LE-SINR is not reported. It would be nice to compare SINR and LE-SINR using the same experimental setting, where species categories in the test set are also used during training.
- Limited Technical Novelty: The proposed framework builds upon the previous work, SINR, which uses a ResNet MLP for location encoding. Such networks have been shown to lack the ability to capture high frequency spatial information. Recently, several location encoding frameworks have been proposed such as GeoCLIP [1], SH [2], etc. The paper does not compare the performance of different location encoding backbones.
- Discussion and evaluation for what the text encoder branch has learned is missing in the paper. Does it learn the hierarchy present in the species taxonomy? Does the textual embeddings have some spatial correlation or patterns? Do similar species have similar textual embeddings?
[1] Cepeda, Vicente Vivanco, Gaurav Kumar Nayak, and Mubarak Shah. "GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization." In Thirty-seventh Conference on Neural Information Processing Systems. 2023.
[2] Rußwurm, Marc, Konstantin Klemmer, Esther Rolf, Robin Zbinden, and Devis Tuia. "Geographic Location Encoding with Spherical Harmonics and Sinusoidal Representation Networks." In The Twelfth International Conference on Learning Representations. 2023.
问题
- Section 1 (Introduction) and Section 3.1 (Problem Setup), use the notation ‘location’ for representing (lat, lon). However, the rest of the paper uses ‘position’ for the same.
- Section 3.3 is confusing to read. Writing should be improved.
- L139-L143, the text refers to species branch. What exactly is that referring to? From Figure 2, it looks like the location encoder has two branches of computation. However, from the text it seems that the text encoder has two branches of computation.
- L148-L152, please clearly explain here that the final loss is formulated as a matching loss which maps as R^256->R^1. The model needs to perform S forward passes to compute the loss for each species category. That is why computing the original An-full loss is computationally expensive in this framework and a modified version is used as described in section 3.4.
- I understand the formulation of loss in Section 3.4, however it is confusing to read. Please include an equation to highlight how the proposed loss is different from the loss used in SINR.
局限性
The limitations are adequately addressed in the paper.
[CYPb-1] The model is given range/habitat information as input.
It is true that the model is given this information as free-form input text. However, these range descriptions do not provide enough detail to draw a precise range map. For instance, knowing that the Gray Kingbird breeds in the "extreme southeast of the United States, mainly in Florida" (Appendix B, Example #1) is not enough information to draw a map of its range within the United States. What parts of Florida? And where in the southeastern U.S. outside Florida?
We demonstrate that it is possible for a model to make use of this relatively coarse text information to produce strong zero-shot and few-shot performance (see Fig. 3). This type of range information could indeed be readily available for a species that was historically recorded to science with no spatial observations or simply for species with very few observations (i.e. the vast majority of species on platforms such as iNaturalist only have a few observations and thus would benefit from better few-shot range estimation models).
[CYPb-2] Results when training with text only L145-L147.
We present the results for our model without the learned species tokens (“LE-SINR no species tokens”) for the zero-shot setting in the table below. Its performance is similar to our full model where these tokens are learned (“LE-SINR”). The advantage of including the learned tokens is that we can evaluate on those species directly if they are included in the training set. We will update the text on L145 to more precisely quantify their performance similarities in the zero-shot setting.
| IUCN - habitat | SNT - habitat | IUCN - range | SNT - range | |
|---|---|---|---|---|
| LE-SINR no species tokens | 0.319 | 0.525 | 0.534 | 0.629 |
| LE-SINR | 0.320 | 0.525 | 0.533 | 0.636 |
[CYPb-3] Adding evaluation species to the training set.
As requested, we performed an additional experiment where we added the evaluation species to the training set (“LE-SINR w. eval species”) and compared these results to our model from the paper where they are excluded (“LE-SINR wo. eval species”). Perhaps unsurprisingly, adding the evaluation species improves performance. We will add these results to Table 1 in the revised text. Note, our performance is lower than the SINR baseline in Table 1 as we are only using LLM summarized text as input at evaluation time (L206) which differs from the training text (L186), whereas SINR gets to use the learned species tokens.
| IUCN - habitat | SNT - habitat | IUCN - range | SNT - range | |
|---|---|---|---|---|
| LE-SINR wo. eval species | 0.320 | 0.525 | 0.533 | 0.636 |
| LE-SINR w. eval species | 0.384 | 0.610 | 0.598 | 0.685 |
[CYPb-4] Comparison to different location encoders.
Our approach is agnostic to the choice of location encoder used. As requested, we performed additional experiments where we compare to GeoCLIP (Cepeda et al. NeurIPS 2023) and Spherical Harmonics (Rußwurm et al. ICLR 2024). In both cases, these encoders also make use of our LLM encoder. In the table below we observe that our LE-SINR approach outperforms these approaches. These results are perhaps not too surprising, because as noted in Table 1 (c) in Rußwurm et al. ICLR 2024, their spherical harmonic encoding does not actually perform better at the geo prior task (the one most closely related to range estimation task) compared to standard “wrapped” encoding that we use. Additionally, for the GeoCLIP experiment, where we start with their pre-trained network instead of our location encoder and fine-tune it, we also outperform them. This is also not surprising as their encoder is trained on web-sourced images that depict common everyday categories and are not necessarily specific to the natural world.
| IUCN - habitat | SNT - habitat | IUCN - range | SNT - range | |
|---|---|---|---|---|
| GeoCLIP | 0.229 | 0.489 | 0.414 | 0.579 |
| Spherical Harmonics | 0.309 | 0.518 | 0.528 | 0.626 |
| LE-SINR | 0.320 | 0.525 | 0.533 | 0.636 |
[CYPb-5] Discussion of what the text encoder has learned.
This is a good suggestion, we will include additional discussion regarding what our language encoder has learned. Regarding hierarchy, our comparisons to LD-SDM (see response to 9uUS) indicate that simply encoding taxonomic hierarchy results in sub-optimal performance. In response to q7Tg’s question we also plan to illustrate which parts of the text are important to the model.
[CYPb-6] Use of the words “location” and “position”.
Thanks for flagging this, we will make our usage of the terminology more consistent.
[CYPb-7] Species branch (L139-L143).
The species branch refers to the two mechanisms we have for encoding species information: our text-based encoder and learned species tokens . We will refine this text and Fig 2 to make this clearer.
[CYPb-8] Clarification of loss function computation (L148-L152).
We will add the suggested text to improve the description in this section.
[CYPb-9] Description of training loss in Section 3.4. We will update the text in this section to make it easier to read and add an equation such that the comparison to SINR is clearer. Thanks for the suggestion!
I thank the authors for addressing the concerns raised by all the reviewers. I appreciate the authors for reporting additional results, that help strengthen the paper. Although I am still not entirely convinced about providing the model with range/habitat text as input, especially when non-experts are using the system, I have updated my score, after carefully considering the responses to other reviewers.
This paper presents LE-SINR which mappes species observations and textual descriptions into the same space and enables zero-shot inference for species range mapping for unseen species. The textual description of species are encoded with an LLM and used as a species embedding that jointly trained with location embedding for species range mapping.
优点
- Using free-form textual descriptions of species plus LLM to generate species embedding and trained with location embedding sounds like a logic next-step based on the existing work. The zero-shot ability is very attractive.
- The experimental setup looks very sounds.
- The geospatial visualization also hints on the meaning of learned location embeddings.
缺点
- In Section 2, the author mentioned that the most relevant work of LE-SINR is LD-SDM. However, I do not see LD-SDM as one of baselines in the zero-shot and few-shot experiments.
- More ablation studies on the loss function on multimodal data are needed. The contrastive losses such as SatCLIP [1] and CSP [2] can be candidates of loss functions to be compared with the one used in the paper.
- Equation 1 and 2 look wired. The first term of both loss functions should only have one "-", right?
Minor issues:
- Line 118: "species x_i" -> "species y_i"
[1] Klemmer, Konstantin, Esther Rolf, Caleb Robinson, Lester Mackey, and Marc Rußwurm. "Satclip: Global, general-purpose location embeddings with satellite imagery." arXiv preprint arXiv:2311.17179 (2023). [2] Mai, Gengchen, Ni Lao, Yutong He, Jiaming Song, and Stefano Ermon. "Csp: Self-supervised contrastive spatial pre-training for geospatial-visual representations." In International Conference on Machine Learning, pp. 23498-23515. PMLR, 2023.
问题
- In few-shot evaluation, when you do the logistic regression, what part of model architecture is frozen?
- Figure 3, Is the left figure about zero-short evaluation? Then what does the x-axis of observations per species mean?
局限性
The author discusses about limitations and potential negative societal impact of their work in Section 4.4.
[9uUS-1] Comparison to LD-SDM.
While related, LD-SDM (Sastry et al. arXiv 2023) uses text data in a fundamentally different way to us. Instead of using unstructured text as input, their model generates an input string for each species that encodes its full taxonomic hierarchy (e.g. class -> order -> family -> genus -> species). Then for their zero-shot evaluation, they simply remove the final species name from the input text and evaluate on previously unseen species. Their core assumption is that they have seen species from that genus at training time. Our approach does not make such a strong assumption.
As requested, we performed an additional comparison to LD-SDM. We use the same architecture as our LE-SINR method with environmental feature inputs and simply change the training text. As our text encoder does not use causal attention, during training we randomly blank out different tokens to simulate the full taxonomic hierarchy not being present in the text string at evaluation time. In the table below, where the evaluation species are part of the training set, we crop the input text to different taxonomic levels during evaluation to simulate having coarser information (e.g. the “genus” row, corresponds to only including the full taxonomic name up until genus at test time).
| LD-SDM | IUCN | SNT |
|---|---|---|
| species | 0.214 | 0.334 |
| genus | 0.192 | 0.298 |
| family | 0.127 | 0.246 |
| order | 0.067 | 0.207 |
| class | 0.047 | 0.19 |
In comparison, our approach performs significantly better.
| LE-SINR | IUCN | SNT |
|---|---|---|
| habitat | 0.3198 | 0.5246 |
| range | 0.5332 | 0.6363 |
[9uUS-2] Comparisons to contrastive losses.
As requested, we perform additional comparisons using a contrastive training loss, specifically the one from SatCLIP (Klemmer et al. arXiv 2023). During training we compute a contrastive loss between species and locations within a batch. We also extend this method (SatCLIP + negs.) by adding 50% uniform random locations as additional negatives. However, in both cases we see that it underperforms compared to our method. Additionally, our LE-SINR approach is compatible with different input encodings. See response to CYPb for further comparisons.
| IUCN - habitat | SNT - habitat | IUCN - range | SNT - range | |
|---|---|---|---|---|
| SatCLIP | 0.255 | 0.521 | 0.463 | 0.630 |
| SatCLIP + negs. | 0.306 | 0.503 | 0.523 | 0.617 |
[9uUS-3] Equations 1 and 2.
Yes there is a minor typo here, thanks for flagging this. We will remove the extra minus at the start.
[9uUS-4] Which parts of the model are frozen for few-shot evaluation.
During few-shot evaluation, the entire model is frozen (i.e. the location encoder and the text-based species encoder ). We only optimize the final classification weight vector for the species of interest. We will update Section 3.6 to clarify this.
[9uUS-5] X-axis in Fig. 3 (left).
As indicated in the caption, this figure displays both zero-shot (i.e. “Observations per Species” = 0) and few-shot (i.e. “Observations per Species” > 0) performance. Only our language enhanced models are capable of producing zero-shot predictions. We will update the caption to make this clearer.
I am satisfied with the authors' response and willing to improve my score to 7.
The paper considers the problem of species range mapping, where the aim is to estimate, at any given location on the earth, if a particular species is present or not. The work builds on another very recent paper which developed "Spatial Implicit Neural Representations", where the aim is to estimate presence/absence of different species by encoding geographical information through neural network representation. This work extends SINR by utilizing textual information about the target species, which may be present in Wikipedia pages. For this purpose, they consider embedding of textual features using a LLM. The textual and geographical embeddings are fused to predict the presence/absence of species at different places, and this approach allows them to extend this to even those species for which there are very few or even zero observations.
优点
The paper discusses species range mapping - a very important problem in conservation biology, ecology and other related disciplines. This is one of the early applications of ML in this domain, potentially opening up an entirely new area.
- The paper introduces LE-SINR, where textual data (wikipedia entries) are combined with observations to identify the presence/absence of species at different locations of the world
- The main technical contribution lies in the use of LLMs to extract geospatial information from text, and to encode such information into spatial locations on the map
- The work shows extensive experiments to validate their claims and develop useful maps. Apart from the main aim of species range mapping, these maps are also used to demonstrate the effectiveness of geospatial information extraction by LLMs and its embedding.
缺点
No weaknesses as such since this is a very new domain of application, but the paper throws up some unanswered questions, mentioned below:
问题
- We would like to understand how exactly the geospatial encoding works in response to text prompts. In Figs 4 and 5 we see the spatial outputs in response to textual inputs - can we have some understanding of which parts of the text were most informative for the LLM in identifying the spatial locations?
- In most of the examples, we see that the locations highlighted on the map are very specific - typically one or two clusters in one continent. But how does the model work for mapping of species (or non-species concepts) which are common to many places across the world?
- In case of the zero-shot case, are the range maps limited to those locations that are explicitly mentioned in the text prompt, or can the LLM make additional inference also (since it has been pre-trained on a much wider corpus of text)?
- Can we use the concepts of ecology (related to predator-prey or symbiotic relations) as side information, to use the known information about presence/absence of some species at a location to infer the presence/absence of other species there?
- Similarly, if a species is known to be present in some types of habitats (based on climate zones, biomes etc), can we infer about their possible presence in similar habitats?
- Can we move from presence/absence detection to estimating population sizes of different species in different locations?
局限性
Since this work is related to conservation sciences, accuracy (especially precision/recall) and robustness are important considerations. It is not very clear to me if this work can be used to ascertain the conservation statuses of different species, but if it can be, the work may require a more sensitive treatment. Also, we need to understand if this work can be used by malicious players (eg. illegal poachers, smugglers) to perform operations that may endanger the biodiversity of the earth.
[q7Tg-1] Which parts of the text were most informative for the LLM?
This is a great suggestion. Given the space limitations we cannot provide a visualization of this in the rebuttal, but we will include this in the final revised paper. At a high level, text which more directly encodes information about the range is more informative than text that just describes the habitat (see results in Fig. 3).
[q7Tg-2] Does the model work for species that are common to more than one location?
Just like SINR, there is nothing conceptually stopping the model from predicting that a species occurs in more than one location on earth. In Fig. B3 in the rebuttal PDF we illustrate two examples of text that describe different climate zones that overlap with multiple different countries. We will include some additional species’ examples in the final revised text.
[q7Tg-3] For zero-shot evaluation, can the model generalize beyond locations explicitly mentioned in the input text?
Yes. The example in Fig. B3 in the rebuttal PDF shows examples where specific locations are not mentioned, yet the model produces sensible predictions. Some of the examples in Fig. 5 in the main paper also demonstrate that the model is able to spatially encode non-species concepts from a small number of non-geographic words (e.g. “Hello Kitty” or “Babe Ruth”). We believe that exploring how LLMs spatially represent concepts is an exciting future research direction.
[q7Tg-4] Can we use additional ecological concepts as side-information?
This is a really interesting suggestion! Beyond the scope of this work, but it could be possible to use text related to species interactions (or lack thereof) to further improve the spatial representations we learn. Such knowledge is not likely explicitly codified at a large scale in a structured way online, but this makes natural text an ideal candidate to encode it.
[q7Tg-5] Can the model infer species’ presence in similar habitats?
Yes. We believe that the model is already doing reasoning of this form. However, there are some possible limitations of non-spatially explicit encodings (i.e. only using environmental features and not spatial coordinates as input). When only using information derived from environmental habitats, it could be possible for a model to incorrectly predict that a species is present in that habitat across different continents (see “desert” example in bottom row of Fig. 5).
[q7Tg-6] Can we estimate population size instead of just presence and absence?
This is a longstanding and open question in statistical ecology, and out of scope for this initial work. This question has been studied extensively in the literature (e.g. Pearce and Boyce, Journal of Applied Ecology 2005), but without information about true absences or individuals it remains under constrained. However, proxies such as relative abundance can sometimes be estimated.
[q7Tg-7] Estimating the conservation statuses of different species?
As noted in Section 4.4, great caution should be exercised when using the outputs of our work for any downstream conservation assessments. While species range is an important factor in determining the threatened status of a species for repositories such as the IUCN Red List, other information is also needed (e.g. the list of threats they are susceptible to).
I thank the authors for their responses, and I really like the work.
The authors extend the SINR model for species distribution modeling (SDM) by aligning the learned, spatial, latent space with the representation of the species habitat/range provided by an LLM. This addition allows the authors to evaluate their approach in a zero-shot setting, something that SINR or other SDM models are unable to do. Their results also show the advantage of this method in a few-shot setting.
优点
The approach is very simple and intuitive, and the zero- and few-shot results show that this has a good potential for usability with species that have few observations.
缺点
I would certainly like to see this work published, as I find their main contribution (evaluating the usefulness of species habitat/range text for SDM) interesting for researchers working on this topic. However, I do wonder whether this is the type of contribution the NeurIPS readership is looking for.
问题
-
I’m not sure why the loss that seems to be the one used for training, Eq. (2), appears in the section on few-shot evaluation, rather than in the loss functions section. I suggest that authors revise the order in which they present their method for understandability.
-
I think page 9 could be better used than it is now. Although the mapped concepts are fun to see, I think they belong in the appendix. I would suggest to instead do a more relevant evaluation, for instance by using Köppen climate zone descriptions.
局限性
Limitations of the approach are addressed.
[A4uw-1] Within scope for NeurIPS?
We believe that NeurIPS is an appropriate venue for this work given similar work published at top ranked machine learning venues in the past. For example, NeurIPS (“Active Learning-Based Species Range Estimation” by Lange et al. and “SatBird: Bird Species Distribution Modeling with Remote Sensing and Citizen Science Data” by Teng et. al.), ICML (“A Maximum Entropy Approach to Species Distribution Modeling” by Phillips et al. and “Spatial Implicit Neural Representations for Global-Scale Species Mapping” by Cole et. al.), ICLR (“Geographic location encoding with spherical harmonics and sinusoidal representation networks” by Rußwurm et. al.), and AAAI (“Bias reduction via end-to-end shift learning: Application to citizen science” by Chen and Gomes). We believe that new methods for estimating the spatial range of species is a question of societal importance and requires input from machine learning researchers. Additionally, insights into how spatial information is encoded into LLMs is highly relevant to the machine learning audience.
[A4uw-2] Moving equation 2 for enhanced readability.
Thank you for the suggestion! We will make this fix.
[A4uw-3] Köppen climate zone descriptions.
Evaluating more species related text descriptions such as Köppen climate zones is an interesting suggestion. As requested, we have included two zones in Fig. B3 in the rebuttal PDF. We observe LE-SINR is able to give plausible estimates of these zones, particularly where iNaturalist has reasonable training data coverage. However we observe false negative predictions in areas of low data coverage (e.g. central Africa in the figure on the right).
I thank the authors for the additional results. The climate zone descriptions do seems to provide a more interesting insight into the limitations of the model.
After seeing the concerns of the other reviewers and the authors' responses, I lean towards acceptance.
This paper considers the problem of species range maps (SRMs) estimation and proposes to combine textual descriptions of habitat or range of species from Wikipedia with geolocated citizen science species observations, building on the SINR model. The method is evaluated in the context of zero-shot and few-shot estimation of SRMs for eBird S&T and IUCN species. This is an interesting idea to tackle the problem of SRM estimation for species with few observations, some results seem promising but I have some concerns and questions about the experiments and choice of the evaluation tasks. I am open to revising my score if the author clarify them.
优点
- The paper is well-written.
- The problem considered is important, and the angle of looking at few-shot and zero-shot learning for SRMs has so far been overlooked in the literature.
- The paper proposes a novel multi-modal approach to species range mapping.
- The related works section is comprehensive.
- The figures are well presented and illustrate the main points of the paper well.
缺点
Motivation and use case:
I am unsure of the motivation of estimating SRM based on “textual descriptions that might be known to an ecologist” of range observations. It seems like the map is already self contained in the description of range which seem quite detailed and an end user to this model would probably be able to draw a range map based on the textual description of the range, or not gain additional information from having a map produced by LE-SINR compared to the textual description which is reliable already. I am happy to hear more about the use cases the authors had in mind when developing this method.
On the other hand, using habitat information is an interesting proposition but habitat suitability maps are not range maps, and there are inherent limitations to using habitat descriptions to estimate range, and can lead to some problematic predictions if species have the same habitat (but not the same range) and those are usually treated in ecology as different problems. That being said, I have noted that the examples presented in Figure 4 with habitat text description show that the model is able to “restrict” geographically the range of species successfully to the relevant continents, without any postprocessing I suppose (?).
Baselines
The baselines seem a bit simplistic. I would have expected another baseline to be a model trained on environmental variables from WorldClim for example, in order to have some comparison to a model that is not just the mean species distribution map for all species or constant prediction. Especially given that all models compared to LE-SINR have the eval species in the training set, it seems it would be possible to make more realistic baselines. But maybe I missed a point and therefore: Could the authors describe how the mean species distribution map obtained? Could the authors describe what the baseline model mean _env +eval Sp consists in ?
Results:
It seems that the methods seems most advantageous in the zero-shot setting (over whose motivation I raised some concerns about in the “motivation and use case” point) and performance in the few shot setting is not very convincing. In 4.2, it is highlighted that one of the reasons might ne that “the logistic regression models are trained independently for each target species using uniform negative samples, the original SINR model trains all the species together, benefiting from other species observations to capture the negative set.” It is appreciated that the authors highlight this difference, and it would make a stronger paper and comparison if the SINR design choices for capturing the negative set were kept for LE-SINR (and perhaps give more convincing results).
Minor comments:
No error bars are reported but the authors provide a justification for that in the paper checklist
The paper has some good ideas, and some examples shown seem to point to the potential of the proposed method but the choice of the evaluation tasks is questionable given the information present in the paper, and it seems that the main highlighted advantage of the method is that fact that it can be applied in a zero-shot setting, and the method does not seem particularly advantageous in comparison to existing methods in other settings. I am open to revise my score, but would need more details on the motivation for this choice of evaluation tasks.
问题
In addition to concerns/questions in the Weaknesses section, I have the following questions:
-
In section 3.2, it is mentioned that “not all species in our observation dataset have an associated text description”. Can you clarify how the species with no text descriptions are handled? I understood that only the position branch is used in that case, is that right?
-
Have the authors done some analysis of whether species with certain habitats/geographical ranges are better predicted?
局限性
The authors clearly acknowledge the limitations of their work.
[2RN9-1] Motivation and use case.
The primary motivation of our work is to leverage an additional data modality, text, to improve both zero-shot and few-shot species range mapping. We observed that text data, as formatted on Wikipedia, often includes descriptions, habitat information, and range information for species. We acknowledge that the presence of detailed range descriptions can simplify, and in some cases trivialize, the process of producing a range map for a species. This is precisely why we also conducted experiments using only habitat information. While performance decreases compared to using range descriptions, we demonstrate that incorporating habitat text still offers improvements over previous methods (see Fig. 3).
As a motivating use case, we envisioned a scenario where a scientist, recently returned from the field, describes the habitat in which they observed a possibly new species and uses our method to visualize a plausible range map. In practical applications, the maps generated by our method are intended to serve as a starting point for further refinement. This initial starting point can be valuable for generating detailed maps and guiding more in-depth studies. A large percentage of species on platforms such as iNaturalist only have a small number of observations, but do have Wikipedia pages, and thus would benefit from better few-shot range estimation. Furthermore, some species are presumed extinct, have no modern observations, but still have text descriptions on Wikipedia (e.g. “New Caledonian owlet-nightjar”).
We understand that the use of habitat text could conceivably create output maps that show the “fundamental ecological niche” of the species rather than the range or the similar “realized ecological niche” of the species. In practice, as noted in the review, our model does manage to “restrict” geographically. This is not due to post processing. Instead our model is able to infer the range from the habitat text due to mentions of other species, and specific features of that location. For example, for the hyacinth macaw, the habitat text ends with: "these parrots are found... in dry thorn forests known as caatinga, and in palm stands or swamps, particularly the moriche palm (Mauritia flexuosa)." World knowledge encoded in the LLM about "caatinga" and "Mauritia flexuosa" allows the model to successfully select from the many locations that fit the first line of the description of "semi-open, somewhat wooded habitats" and so correctly chooses South America as the likely home of this species.
[2RN9-2] Baselines.
The Model Mean model is the Oracle SINR model (trained with or without environment features), but whose output is the average of all species outputs (including or excluding the evaluation species) for each input. Model Mean +Env +Eval Sp. is therefore the Oracle SINR model trained with additional environmental features as input and is trained with observations from the evaluation species. We will clarify this in the text, sorry for the confusion!
In response to requests from other reviewers, we have included additional quantitative comparisons, e.g. LD-SDM (Sastry et al. arXiv 2023) and the contrastive SatCLIP (Klemmer et al. arXiv 2023) in response to 9uUS, GeoCLIP (Cepeda et al. NeurIPS 2023) and Spherical Harmonics (Rußwurm et al. ICLR 2024) in response to CYPb. See the responses to the other reviewers for results.
[2RN9-3] Results.
Our method is advantageous in both the zero-shot and the low-shot setting, where we are much better than the existing SINR. The results in the low-shot setting are particularly noteworthy, as we observe a consistent performance improvement all the way up to, and including, the 10 training observations per species setting. This is important as thousands of species on platforms such as iNaturalist have a limited number of observations.
We agree with your desire to control for the negative sampling process in our few-shot results. We provide these additional results in Fig. B1 in the rebuttal PDF. We observe the relative ordering of the different methods stays the same, and we still observe a large boost in performance from our method compared to SINR.
[2RN9-4] Error bars.
The results in Fig. 3 are averaged over hundreds of species, and as noted on L617 the standard deviation is very small. We can include them in the final revision if deemed important.
[2RN9-5] Species in evaluation set with no text descriptions.
For training species with no text descriptions we only use the learned species tokens, i.e. SINR approach. For evaluation species with no text description we just skip them and set the performance as 0. There are four evaluation species with no Wikipedia text.
[2RN9-6] Analysis of whether certain regions are better predicted.
Interesting suggestion! Please see Fig. B2 in the rebuttal PDF for an investigation into how our performance is biased geographically. We observe that, perhaps unsurprisingly, our approach underperforms in regions with limited training data (e.g. central Africa), with a particularly high error for Lake Victoria. As there are very few training examples in this region our model has almost no understanding that this lake exists and gives similar predictions for the lake and surrounding land areas, despite the large difference in species present on land and in water.
I thank the authors for their responses to my concerns and for providing details of additional experiments and have updated my score accordingly.
Our work introduces LE-SINR, a new approach for geospatial grounding of free-form text. We apply LE-SINR to species range mapping, one of the most important problems in ecology and conservation policy. By integrating geospatial encoders with LLMs, LE-SINR achieves state-of-the-art performance on both zero-shot and few-shot species range mapping. Importantly, these findings demonstrate for the first time the potential of free-form, uncurated, text to improve species range mapping. Our models, code, and data will be released to support future work on this topic.
There were two common themes in the reviews, which we address here. Other questions are addressed in individual responses to reviewers.
Evaluation
First, reviewers requested additional comparisons and evaluation. We have included numerous new results and ablations in the table below, and have provided descriptions of each comparison in the responses to the individual reviewers. Our LE-SINR approach still obtains state of the art performance in all cases.
| IUCN(Habitat) | SNT(Habitat) | IUCN(Range) | SNT(Range) | |
|---|---|---|---|---|
| Ours | ||||
| LE-SINR pos | 0.285 | 0.510 | 0.469 | 0.607 |
| LE-SINR pos+env | 0.320 | 0.525 | 0.533 | 0.636 |
| Ours - Oracle (Eval Data in Train Set) | ||||
| LE-SINR pos Oracle | 0.363 | 0.593 | 0.543 | 0.667 |
| LE-SINR pos+env Oracle | 0.385 | 0.610 | 0.598 | 0.685 |
| Species Representations | ||||
| LE-SINR no species tokens E | 0.319 | 0.525 | 0.534 | 0.629 |
| Different Backbones | ||||
| GeoClip | 0.229 | 0.489 | 0.4143 | 0.5785 |
| Spherical Harmonics | 0.309 | 0.518 | 0.528 | 0.626 |
| Contrastive Loss | ||||
| SatClip | 0.255 | 0.521 | 0.463 | 0.630 |
| SatClip + random negatives | 0.306 | 0.503 | 0.523 | 0.617 |
| LD-SDM | IUCN | SNT | ||
| Species | 0.214 | 0.333 | ||
| Genus | 0.192 | 0.298 | ||
| Family | 0.127 | 0.246 | ||
| Order | 0.067 | 0.207 | ||
| Class | 0.047 | 0.191 |
We first consider including the evaluation species and associated observations in the training data (i.e. Oracle). As expected, this improves performance in all settings. We next train without the joint species representations. In this case, we only use the text-based encoder without the species’ tokens. Performance is roughly the same as the joint representation method. That said, the species tokens in the joint representation allows for direct evaluation of species in the training set.
Next we compare against alternative backbones for the location encoder. For GeoClip, we initialize the location encoder with pretrained weights and finetune it. For spherical harmonics, we replace our standard “wrapped” encoding with the spherical harmonic encoding. Neither backbone leads to an improvement over our method. However, we note that our contributions are orthogonal to choices of encoder type and model architecture.
We also evaluated the SatCLIP contrastive loss instead of our standard loss. For SatCLIP, we contrast locations and species within a batch. Since there are many locations that have no observations in the dataset, we also try a modification where additional uniformly sampled locations are included as negatives. This mimics the negative sampling strategy of our loss. In both cases we still see reduced performance compared to our method.
Finally, we train using the taxonomic hierarchy strings of LD-SDM instead of Wikipedia text. We report zero-shot results by withholding eval species from the training data. We can see that taxonomic information alone is not enough to perform well in the zero-shot setting.
Motivation
Second, reviewers wanted additional context for how our work could be used. Species range mapping is a long-tailed problem, which means that most species have very few observations (i.e. <50). Therefore, strong few-shot algorithms are critical if we want to understand the ranges of rare or difficult to study species. The main outcome of our work is a new method that achieves state-of-the-art performance on zero-shot and few-shot species range mapping by enabling users to provide text as input to the models. This text is free form, and can therefore flexibly incorporate the user’s knowledge, whether it is attributes of the species, habitat preference, or general range descriptions. Our method can utilize exclusively this text to generate a predicted range map (zero-shot), and we can take advantage of actual observations to refine the maps (few-shot).
Reviewers asked if the zero shot situation is realistic? Consider presumed extinct species. We may well have range and habitat text from historic descriptions housed in museum collections, but we do not have access to structured observation data used in standard range estimation approaches. Our method can utilize the text to generate a candidate range map. Occasionally scientists also rediscover species that were previously thought extinct. In these cases we may have text data on range and habitat preferences alongside a small amount of observation data from the newly rediscovered species. Our proposed method fills an important gap in this problem space and gives scientists a new tool in their range estimation toolbox. Another practical use of our model is that it allows users to experiment with arbitrary text input to generate plausible occurrence maps. This capability is not restricted to just species range mapping but can also serve as a powerful tool for exploring where different habitats or environmental conditions might exist globally. Additionally, this fusion of language models and location encoders allows us to probe and ground the spatial representations learned by LLMs.
This paper considers the problem of predicting species ranges from observational data drawn from iNaturalist, with auxiliary information provided in textual descriptions from Wikipedia. The reviewers agree that this is an important problem and that the solution proposed is novel, interesting, and useful. Accordingly, I think the paper is a clear accept.
I will note that while the application considered is indeed important, the authors somewhat exaggerate how often the few-shot paradigm comes up. In a rebuttal the authors state that "A large percentage of species on platforms such as iNaturalist only have a small number of observations, but do have Wikipedia pages, and thus would benefit from better few-shot range estimation." This is arguably false, since most species on iNaturalist with a small number of observations are likely e.g. invertebrates with no Wikipedia pages or merely stubs with no appreciable information. There is definitely a slice of species which are known well enough to have useful textual information but not enough to have good range maps, but it shouldn't be implied that this is a majority of under-studied species and it would be worth mentioning this explicitly in the paper when discussing limitations.