PaperHub
8.3
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
5
4
4
ICML 2025

Feedforward Few-shot Species Range Estimation

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24
TL;DR

A new low-shot method for estimating the spatial range of species

摘要

关键词
species distribution modelingSDMspecies range modellingspatial implicit neural representationSINRlow-shot learningfew-shot learning

评审与讨论

审稿意见
4

The paper introduces FS-SINR ("few-shot spatial implicit neural representations") for few-shot species range estimations, which is trained on citizen science location data. A key feature of FS-SINR is that once it has been trained, it can be used during inference to predict the range even of previously unseen species in a feed-forward way. The overall approach of FS-SINR is that during inference it takes as input a set of spatial locations (called "context locations") -- together with (optional) metadata, e.g. a textual or image description -- and then it feeds all info through a Transformer, which outputs a species encoding. To assess presence / absence, a query location x is embedded separately (using the same embedding function f as is used for the context locations), and then the species and query location embeddings are multiplied and fed to a sigmoid for binary decision making. The experimental results show that FS-SINR obtains state-of-the-art results on this task (in two different and relevant metrics), especially in the low-data regime.

update after rebuttal

I thank the reviewers for their responses to mine and the other reviewers' comments. There is consensus on accepting the paper and I also keep my accept recommendation. Nice paper!

给作者的问题

  • Do you have an idea as to why, based e.g. on Fig. 3, results are better using FS-SINR-text than FS-SINR-image+text (yes, I saw the discussion on text being more informative than image in the main paper, but the combination text+image seems to me should be at least as good as text-only (?))?

  • Can you please provide some info on inference runtimes of FS-SINR vs other methods, to show that the claim on 'fraction of compute-time' holds?

论据与证据

Yes, most claims are supported by clear and convincing evidence (such as the statement that FS-SINR obtains SOTA results on two benchmarks, which is clearly shown e.g. in Fig. 3, and also the results appear extra reliable given that the results holds across two relevant metrics, as well as due to the strong emphasis that has been put on making the baselines as good / comparable as possible, as detailed in Appendix A.2).

The one claim I did not find quite as fully convincing is (abstract): "..., in a fraction of the compute time, ...". I did not see much more discussion on compute time in the paper, except around L185 that reads: "Our approach is computationally efficient in that once the species embedding is generated it can then be efficiently multiplied by the embeddings for all locations of interest to generate a prediction for a species’ range." <-- That makes sense to me, but given the statement around 'fraction of compute time', I was hoping to see some actual runtime numbers, ideally in the paper, of the proposed method vs baselines.

方法与评估标准

Yes, the methods / evaluation criteria definitely make sense for the problem at hand.

Examples that support my assessment:

  • The evaluation is done on not only one, but two, of the common species range estimation (SRE) datasets, which makes claims e.g. about FS-SINR being a new SOTA more reliable.

  • Also, not only one but two evaluation criteria are used (mean average precision (MAP), but also distance-weighted MAP), both of which are standard for SRE (to the best of my knowledge), and in both of which FS-SINR is the best.

  • Good that many / most results (e.g. Fig 3) contain error bars with a few seeds, further reinforces the findings.

  • Finally, several relevant baselines are used for the comparisons (and these are well-detailed in Appendix A2).

理论论述

No, theoretical claims were not made.

实验设计与分析

Yes, I did so for all things in the main paper. I found the experimental designs / analyses to be sound and valid. Many of the reasons for this assessment are already explained under "Methods And Evaluation Criteria", but here are some additions:

  • OBS: Before I carefully checked Figure A3 (appendix), I had written the following as an experimental design I was missing, but then it is actually covered in the appendix, so it instead furthers the soundness and validity of the experimental designs and analyses (thus disregard the stuff in quotation marks as something to address in the rebuttal -- it is something good!): "In Sec. 3.2.1, it is discussed how the flexible FS-SINR framework allows for leveraging e.g. image / text info, assuming such info is also available in training. That's great. But what I would have wanted to see somewhere (e.g. appendix) was how the amount of such metadata training affects inference. In other words, what happens if such metadata is used for 10% of the "training instances", 20%, ... 100% (and of course also with 0%). Right now it is as I understand it 50% of the time that text is used, and image is used 90% of the time (?). Similarly -- and also relating to the sentence "Note, we train FS-SINR such that it can use arbitrary subsets of these input tokens during inference" (right before "4. Experiments) -- it would have been nice to see how the 'extra-info-agnostic' approach currently used compares to a variant with 'always-extra-info'. One would perhaps expect that a model that always obtains the metadata would outperform one that is agnostic to the amount of metadata (?)."

  • I liked the use of qualitative analyses as well, as they furthered my intuitive understanding of FS-SINR, e.g.

    • Fig. 4 clearly shows the impact that a text input can have
    • Fig. 7 provides good intuitive insight to the effect of increasing the number of context locations in a few-shot setting.
      • A small negative, however, is that in my view it is not entirely easy to see that the caption statement "As we increase thh number of context locations, the predictions become closer to the expert ranges" always holds. Perhaps it would be possible to compute some quantitative metric for each case, to see that this metric actually improves? (In particular, some times things seem to saturate between 5 and 10 locations.)

补充材料

Yes, I checked some parts, with a particular focus on these parts (which were good in my view):

  • The results around Figure A3 (see the "OBS" comment in the previous box response).

  • Also other relevant figures in terms of quantitaitve results, e.g. Fig. A2. It looks good to me.

  • Also had a look at various additional qualitative results, that looked good in my view.

与现有文献的关系

I think the authors succeed in clearly placing this work relative to the broader scientific literature, and in particular how it improves upon / relates to the existing scientific literature on species range estimation (including that older non-DL-based approaches were included, not just modern DL-based variants).

The FS-SINR approach most clearly builds upon the SINR approach by Cole et al. (2023), and also the LE-SINR extension (Hamilton et al. 2024), but improves LE-SINR in that it FS-SINR can incorporate image (i.e. not just text, as LE-SINR) as metadata, and also that FS-SINR, different from LE-SINR, does not require retraining a classifier for each new species observation added.

The authors by the way note that other approaches, here based on (Lange et al., 2023) and (Snell et al., 2017), also do not require inference time retraining -- so FS-SINR is not entirely novel in that regard -- but those methods on the other hand perform much worse, as shown in the results chapter.

遗漏的重要参考文献

To the best of my limited literature knowledge, no essential references were missing.

其他优缺点

STRENGTHS

  • Great that an extensive Limitations as well as Impact Statement chapter were included. These are important parts that should not be neglected, and I think they provide insightful comments.

  • Building on the above, the problem of species range estimation is clearly a highly important one. So the topic of the paper itself is very relevant given the climate and ecological crises we are in.

WEAKNESSES

  • It would have been good if the paper had looked into some form of uncertainty quantification. I would assume that is a very important thing to be aware of (i.e. a models rough uncertainty of predictions). A simple starting point may be to train individual models and look at ensembles and individual-model-deiviations from ensemble predictions.

  • It was unclear to me what "FS-SINR" was an abbreviation of first. I had to look up the SINR abbreviation by checking the Cole et al 2023 reference. Consider writing it out more clearly in the paper.

其他意见或建议

  • It may be good to state whether "species" refers to animal species or plant species (or both). I believe the latter is the focus when talking about species range estimation, but would still be good to know.

  • I think perhaps there is a typo in eq. (1), where the capital S should be a lower-case s, in consistency with the previous notation?

  • Double word "the the" at Line 349.

作者回复

We thank nqXQ for their careful reading of the paper and constructive suggestions.

[nqXQ-1] Quantification of computational efficiency.
Below we report inference timings for different models (with 1 location + text), reported as the time taken in seconds to generate all evaluation species weights for LE-SINR and FS-SINR on the CPU. We observe that FS-SINR can generate the species embedding vector in as little as 2% of the time taken for LE-SINR which has to perform test time optimization to train a linear classifier for each species. In addition to not requiring any training for held-out species, FS-SINR also has fewer overall parameters compared to the SINR baseline (8.2M vs. 11.9M, see L214 Col2). We will expand on these timing results in the final revised text.

ModelTime
LE-SINR631.3
FS-SINR14.3

[nqXQ-2] Results as more context locations are added to Fig 7.
The quantitative evaluation in Fig 3 shows that as more context locations are added we observe an increase in agreement with the expert-derived range maps. However, we agree with nqXQ that the results do start to plateau and additional context locations provide diminishing returns (at least when averaged across all species in the IUCN and S&T evaluation sets). As requested, we report the AP for each of the species from Fig 7 below. We observe an increase in performance when increasing the number of context locations for these species.

Context locs.Common KingfisherEuropean RobinBlack and White Warbler
10.590.700.49
20.630.720.52
50.790.740.59
100.820.780.68

[nqXQ-3] Uncertainty quantification.
This is a really interesting suggestion. Taking inspiration from Poggi et al. [a] we include a sparsification-based uncertainty evaluation where data is progressively removed based on the uncertainty estimates derived from an ensemble of three FS-SINR models (see Sec 4.1 in [a]). These results are on S&T using range text with different numbers of context locations. We report Sparsification Error AUC (SEAUC) and Area Under the Random Gain (AURG). AURG is positive and increases when more context locations are provided demonstrating that the ensemble is better at estimating its uncertainty than random chance and becomes more accurate as more context locations are provided.

Context locs.MAPSEAUCAURG
00.660.680.03
10.680.710.03
50.710.750.04
100.730.770.04
200.740.790.05

We also visualize the predicted mean and variance for the ensemble model using the species from Fig A16. We observe high variance in locations where the individual models in the ensemble differ (e.g. South America in 2nd row). The result can be found here: https://postimg.cc/VdNTgjD0

[a] Poggi, On the uncertainty of self-supervised monocular depth estimation, CVPR 2020

[nqXQ-4] Performance of text model -- with and without images.
As discussed on L294 Col1, text describing a range is inherently more informative than images. However, in Table 1 we observe that images are still a valuable supervision source when no other meta-data is available (i.e. row 10 vs 4). Adding images with text does not hurt performance on S&T (row 11 vs 9), but does result in a drop in the more challenging IUCN dataset. This same pattern is apparent in Fig 3. The potential explanation here is that images provide sufficiently weaker signal, and greater opportunity to overfit to incorrect spurious features, thus negatively impacting performance. However, it is worth noting that even in the case of IUCN, FS-SINR with image and text still outperforms the recent state-of-the-art LE-SINR (see Fig 3).

[nqXQ-5] FS-SINR abbreviation.
Thanks for the suggestion. We will clarify this at the start of the paper.

[nqXQ-6] Plant or animal species?
We will clarify this. As noted on L241 Col1, we use the same training data as Cole et al. 2023 which contains observations for 47,375 species of plants and animals.

[nqXQ-7] Typos.
Thanks for flagging these two typos, we will fix them.

审稿意见
5

This paper outlined a new approach for few-shot species range estimation. The goal is to outline geospatial regions where an animal is likely to live based on previous observations of occurrence. The authors' approach builds upon Spatially Implicit Neural Representation (SINR) models designed to estimate species range based on location alone. Their new FS-SINR approach leverages a Transformer-based head and a novel set of 'context' locations, giving the model examples of where a new out of distribution organism might be found, in addition to the desired query location at inference. They benchmark their approach against the original SINR model, an active learning approach, and another SINR-based model that encodes free-text descriptions of species range sourced from the internet. They tested performance in few-shot and zero-shot situations, measuring performance using the IUCN and S&T baselines articulated in the original SINR paper. They registered marginally improved performance against the existing models. While the headline MAP numbers are comparable, their model does not require retraining for every new species under observation at test time---FS-SINR achieves that performance improvement without requiring expensive new training cycles.

Update after rebuttal

Thanks to the authors for their responses to all the reviewer comments. My assessment remains unchanged and seems in line with that of the other reviewers.

给作者的问题

N/A

论据与证据

The claims made by the authors seem sound. Their results and claims---namely that their model performs well in the few-shot case---are supported by their experiments and well-articulated in the paper.

方法与评估标准

  • The proposed methods seem appropriate as does the chosen evaluation data. The authors compared against the three closest model types and a few generic baselines. They also experimented with different multi modal inputs to assess the added value of including images, free-text metadata, etc.

  • The authors should specify the spatial resolution of their model. Did they aggregate the presence data in tiles along the line of the original SINR paper? Or did they use a different strategy and/or spatial scale?

理论论述

N/A

实验设计与分析

  • At line ~260 they describe holding out species in the union of the IUCN and S&T baselines 'unless otherwise stated.' I haven't been able to spot any notes of exceptions. Are there particular instances where an animal from that union was included in training?

  • I am a little confused about the zero-shot experiments described in section 4.3 (starting on page 6). Was the experiment prompting the models with the name of an unseen species and some combination of additional metadata? Or something else? Some of that information shows up in the caption of Table 1, but it is not effectively laid out. A little clarification in the beginning of the section would be helpful.

  • In section 4.4 the authors reference the 'ecologically relevant breakdown' of their results in appendix D. Space permitting, it would really strengthen the paper to include some of that material in the main body. Part of the value of their model appears to be robustness to the domain biases and distribution shifts they articulate in that appendix. At least a paragraph summarizing the major findings from the appendix seems appropriate.

补充材料

The authors included an extensive supplement with lots of example plots. I read the text, especially appendix D, but did not get to look at all the figures in detail.

与现有文献的关系

This vein of work is of keen interest to ecologists and conservation biologists. The results in appendix D as it pretains to underlying sample biases is especially relevant given increasing recognition that sample bias impacts our understanding of ecological patterns and processes (e.g. Hughes et al., 2021; https://doi.org/10.1111/ecog.05926)

遗漏的重要参考文献

N/A

其他优缺点

N/A

其他意见或建议

On page 7, the authors should provide a direct reference to appendix D containing the 'ecologically relevant breakdown' of results.

作者回复

We thank HT4t for their helpful questions. By addressing these comments we believe that the description of the data processing and the evaluation protocol in the revised paper will be much clearer.

[HT4t-1] Spatial resolution of the model and data aggregation.
We use the same training and evaluation data as SINR (Cole et al. 2023), where the main difference is that by default the evaluation species are not included in our training data. As in SINR, FS-SINR uses continuous coordinates as input so there is no spatial scale explicitly defined for training, i.e. it is implicit. For the evaluation locations, we follow the same pre-processing steps as SINR, i.e. H3 cells at resolution five, which results in 2M total cells, each with an average area of ~250km^2. We will clarify this in the revised text.

[HT4t-2] Held out data comment on L260.
In Table 1 there are results for models that also train on the evaluation species (i.e. TST), but otherwise by default these evaluation species are held out of the training data.

[HT4t-3] Zero-shot experimental protocol.
We follow the same zero-shot evaluation protocol as LE-SINR (Hamilton et al. 2024). Specifically, for FS-SINR we either provide habitat (HT), range text (RT), or an image (I) for each evaluation species as input. Additionally, we also can evaluate our model when no meta-data is provided as we simply use the output of the class token as the species embedding vector (see Fig 2). In this case, the class token will be the same for all species. With the exception of the taxonomic rank text (TRT) variant which is inspired by the hierarchical species name encoding from Sastry et al., no specific species name text is provided to any of our models. We will update the relevant text to make this clearer.

[HT4t-4] Summarize ecological findings.
On L50 Col2 we note that a significant proportion of species only have a small number of location observations available. This indicates that, beyond common charismatic species, the task of species range estimation is a few-shot learning problem. We believe that the primary impact of our work relevant to ecology researchers will be better species range estimates in this few-shot regime compared to existing work (e.g. ~10% MAP compared to SINR in Fig 3 given only 10 observations). In Appendix D we also provide additional analysis that we believe will be of interest to ecologists, e.g. the spatial distribution of errors in Fig. A22, performance by continent in Fig. A23, performance versus range size in Fig. A25, and performance versus taxonomic group in Fig. A26. We will update the text to make these observations more clear and to point to relevant results in the Appendix. Thanks for the suggestion.

[HT4t-5] Hughes et al., 2021.
Thanks for flagging this paper. We will reference it in Sec 5 as it is an excellent reference related to the discussion of biases in natural world data.

审稿意见
4

The authors work within the problem setting of species-range estimation, where, given a latitude--longitude pair and a target species, the task is to determine the probability of being able to find that species at that location. Motivated by the large number of species for which only sparse sightings have been recorded, the authors propose a method for few-shot species-range estimation. Their formulation enables them to generalize to new species at inference time without retraining, and to condition off non-location data to improve coverage.

update after rebuttal

The authors' willingness to address outstanding issues is appreciated. Following their rebuttal commitment to add experiments, improve contextualization of results, and update the title, the recommendation is updated from Weak Accept (3) to Accept (4).

The marginal quantitative improvement over prior work and the unresolved issues with image-conditioning performance limit the recommendation it can be given (from Strong Accept), but the paper is solid, well-written, and relevant to the ICML audience, and should be accepted.


The plot image provided in the Reply Rebuttal Comment is also interesting. The authors should consider examining instances where image conditioning is most quantitatively harmful. Are the images consistently 1) of non-animal "evidence of an organism" or do they 2) closely resemble (either in human or model perceptual space) species observed during training that have distinct ranges?

给作者的问题

The authors are encouraged to in particular engage with Weakness #2.

论据与证据

Yes, the claims are supported by the presented evidence.

方法与评估标准

Yes, the benchmark is appropriate for the chosen task.

理论论述

N/A. The authors do not make claims that might necessitate a formal proof.

实验设计与分析

Yes, the experimental design appears sound.

补充材料

Yes, I reviewed the supplementary material in its entirety, though sections of the Appendix were skimmed.

与现有文献的关系

The work is not the first to evaluate few-shot generalization in species-range estimation, which makes the title somewhat confusing, but the authors generally contextualize their work fairly.

遗漏的重要参考文献

No, the works referenced appear fairly comprehensive.

其他优缺点

Strengths

  1. The paper is well-written and was easy to follow.
  2. The authors' extensive Appendix ablations are appreciated.
  3. The task is interesting and well-motivated by data scarcity.

Weaknesses

  1. Performance improvements over the LE-SINR (Hamilton et al., 2024) text-conditioned baseline in both the few-shot and zero-shot evaluations appears somewhat marginal. The proposed approach does outperform it, and the modeling decisions enable simpler generalization to new species, but this may be limiting of contribution. The characterization of the few-shot performance as "impressive" on L360 should either be better contextualized or removed.
  2. The addition of image conditioning appears on average quantitatively harmful, but the authors do not seem to acknowledge this. The result is unintuitive and is deserving of scrutiny. Questions include: 1) Does the issue persist with different visual encoders? 2) Are the same results observed when the model is trained without any text conditioning? 3) What happens if the model is instead trained on text embeddings of captions automatically extracted from the images (using another model)?

其他意见或建议

  1. The title does not seem to clearly reflect the contribution of the paper, as the authors do not claim to introduce the few-shot setting, which was also evaluated in Hamilton et al. (2024). The authors should consider updating it to highlight what makes the work unique.
  2. The tt in Ct\mathcal{C}^t doesn't appear to be defined until Appendix 2.4. Consider mentioning it after first usage.
  3. It appears an error to describe the model as "invariant to the number . . . of the context locations," which seems should necessarily affect the output.
  4. The authors should additionally report model performance in the non-few-shot setting.
  5. Can the model make use of multiple images simultaneously? The lack of numbering on tjt_j and aja_j on L183 lends the impression no.
  6. The authors should use \citet for citations that are referred to as sentence objects.
  7. The "Prototype" baseline is referenced on L245, pages before it is defined.
  8. The authors should additionally define "MAP" on L237 as the figure caption is likely to be read before L265.
  9. The "We" referenced at the beginning of L436 is confusing. It should be clarified that it does not refer exclusively to the paper authors like it is used to a couple sentences later.
  10. The bibliography is very well formatted.
  11. The figures in the supplementary materials should be moved around so that they are located near where they are referenced. They're now often separated by multiple pages.

L002 (and throughout): Few-shot -> Few-Shot

L154: Give -> Given

L209: (e.g., images) -> (i.e., images)

L238: very low-setting setting -> ??

L258: currently best -> best currently

L267: Baslines -> Baselines

L340: (ours) -> (Ours)

L582: "heads". -> "heads."

伦理审查问题

N/A

作者回复

We thank JzC8 for their helpful suggestions.

[JzC8-1] Performance compared to LE-SINR.
We outperform the recent LE-SINR in both the few-shot (Fig 3) and zero-shot (Table 1) settings even though we do not require any training on the evaluation species as in LE-SINR, thus making us much faster (see [nqXQ-1]). These differences can be as large as 4% MAP (e.g. row 8 vs 9 in Table 1). We will update the text on L360 Col2 to better characterize the performance improvement in a more measured way.

[JzC8-2] Value of adding images.
We discuss the image results on L294 Col1 where we say: “Perhaps unsurprisingly, in general we observe that image information is not as informative as text. This can be explained by the fact that a range text description provides much more context than an image of a previously unseen species”. Models trained on image data and no text (row 10) perform substantially better than models trained with no images or text (row 4) on both datasets (see Table 1). This shows that images are a valuable source of supervision, when no other signal is available. In practice, detailed text describing the ranges of most species does not exist, but we may have images. However, even weaker text (e.g. habitat description in row 6) is better than using images and combining images and text, at best results in the same performance (row 11 vs 9 for S&T) or slightly degrades it (row 11 vs 9 for IUCN). We will update this discussion in the revised text.

[JzC8-3] Using a different visual encoder.
As suggested, we conducted additional experiments using a different visual encoder -- a DINOv2 backbone. Perhaps unsurprisingly, this model performs worse than the features we obtain from the iNaturalist trained EVA-02 Vit (L205 Col2) in the paper. We will add these new results to the revised paper.

Context locs.IUCN EVAIUCN DINOSNT EVASNT DINO
00.190.130.380.28
10.450.400.490.44
50.620.560.660.63
100.650.600.700.67
200.660.610.710.68

[JzC8-4] Results without using any text information.
Below we also report results for a model trained using only images and no text. During training the model sees between 1 to 5 images per species. Again this performs worse than when using text information, but providing images at inference helps. As noted above, text supervision is simply much more informative than images. However, using image data is still helpful when no other data is available (e.g. row 10 vs 4 in Table 1), and for most species we do not have text describing their ranges, but we can more easily obtain images.

Eval. imagesIUCNS&T
10.170.35
20.190.38
30.200.39
40.210.39
50.210.40

[JzC8-5] Results using automatically generated text captions.
This is an interesting suggestion. Current vision-language models with open-weights (e.g. BLIP2) are not yet capable of generating detailed descriptions for fine-grained species images, and instead provide relatively coarse captions, e.g. “a horse standing in a field”. We generated text captions for our images using BLIP2 and evaluated an already trained FS-SINR on these captions. Here we used both no prompt and two different prompts to produce captions. “No prompt” produced captions like "a small bird is perched on a branch of a tree with flowers on it", “What species is this?” produced captions like "rufous-bellied hummingbird" and “Where is this?” produced captions like "the savannas of south africa". Results on IUCN below show that in all cases captions are worse than using the original images they are generated from.

Context locs.ImageNo promptWhat species?Where is this?
00.190.070.090.11
10.450.270.300.24
50.620.530.560.52
100.650.600.620.58
200.660.640.650.62

[JzC8-6] Can the model make use of multiple images simultaneously.
Yes, this is possible. Conditioning on one image at inference time, as in the paper results in an MAP of 0.19 on IUCN (row 10 Table 1). If we instead use four images at inference, we obtain an MAP of 0.21. Thanks for the interesting suggestion.

[JzC8-7] Results in the non few-shot setting.
As requested, we conducted additional experiments beyond the 50 context locations used in the paper. Interestingly, even though this FS-SINR model was never trained on more than 20 context locations, adding more locations at inference time does not degrade performance. In the table below, shown for IUCN, we observe that performance saturates at around 50 context locations and the evaluated models gain a very small boost going from 20 to 1000 locations.

Context locs.20505001000
SINR (No Text)0.610.640.650.65
LE-SINR (Text)0.640.660.670.67
FS-SINR (Text)0.670.680.680.68

[JzC8-8] Additional suggestions, minor comments, reordering images, and typos.
Thanks for flagging these. We will address them in the revised text.

审稿人评论

The authors' additional evaluation is appreciated, though the primary concern of image-conditioning performance remains unanswered.

[JzC8-1]

Yes, as stated in the original review, it is clear that the proposed method outperforms LE-SINR. That said, it does not seem appropriate to characterize differences "as large as 4% MAP" as "impressive."

These differences are much too marginal to name the work "Few-Shot Species Range Estimation." As the authors acknowledge, they do not introduce the setting (there is even a Related-Work section bearing the same name). As such, it seems like an implicit misrepresentation---a title is not exactly a claim, but should be representative of a work. If a paper were named "Object Detection," you'd expect it to either introduce the task or be the grand paper that solves the problem; this work is neither.

Instead, the real contribution appears to be in that the method is feedforward and that it can be conditioned on a set of context locations during inference. As such, the paper seems much more aptly named "Feedforward Few-Shot Species Range Estimation" or "Contextualized Few-Shot Species Range Estimation." The authors are strongly suggested to consider updating the title to make it representative of their contribution.

[JzC8-2]

It is a plausible conclusion that "image information is not as informative as text." What does not make sense, however, is that adding images is on average quantitatively harmful. It is unclear why the authors appear unwilling to address this. Copied below are two results tables with added columns to highlight the effect of adding images to both text and no-text FS-SINR. Red indicates that adding images harmed performance (negative).

Table A3 (IUCN):

# ContextImageNoText\ImageImage − NoText\ImageText+ImageTextText+Image − Text
00.190.05+0.14\color{green}{+0.14}0.460.520.06\color{red}{-0.06}
10.450.480.03\color{red}{-0.03}0.550.570.02\color{red}{-0.02}
20.540.560.02\color{red}{-0.02}0.590.600.01\color{red}{-0.01}
30.580.600.02\color{red}{-0.02}0.610.620.01\color{red}{-0.01}
40.600.620.02\color{red}{-0.02}0.620.630.01\color{red}{-0.01}
50.620.630.01\color{red}{-0.01}0.630.640.01\color{red}{-0.01}
80.640.650.01\color{red}{-0.01}0.640.650.01\color{red}{-0.01}
100.650.660.01\color{red}{-0.01}0.650.660.01\color{red}{-0.01}
150.660.670.01\color{red}{-0.01}0.660.670.01\color{red}{-0.01}
200.660.670.01\color{red}{-0.01}0.660.670.01\color{red}{-0.01}
500.670.670.000.670.680.01\color{red}{-0.01}

Table A4 (S&T):

# ContextImageNoText\ImageImage − NoText\ImageText+ImageTextText+Image − Text
00.380.18+0.20\color{green}{+0.20}0.640.640.00
10.490.500.01\color{red}{-0.01}0.660.660.00
20.570.580.01\color{red}{-0.01}0.670.670.00
30.610.610.000.680.680.00
40.640.640.000.690.690.00
50.660.65+0.01\color{green}{+0.01}0.700.700.00
80.690.68+0.01\color{green}{+0.01}0.710.710.00
100.700.69+0.01\color{green}{+0.01}0.710.720.01\color{red}{-0.01}
150.710.70+0.01\color{green}{+0.01}0.720.720.00
200.710.710.000.720.720.00
500.720.71+0.01\color{green}{+0.01}0.730.730.00

Image conditioning improves zero-shot performance but not consistently in any few-shot scenario, even in the absence of text. Why is this? All image-conditioning evaluations should be moved to the related-work section if not core to the work.

[JzC8-3]

The visual-encoder evaluation is appreciated. From this (where performance becomes even worse relative to the NoText\Image model), it appears that the model is most likely being overfit to the image embeddings, harming generalization. The authors should consider evaluating after a single epoch (adjusting hyperparameters).

[JzC8-4]

The image-only evaluation is appreciated. The authors should not, however, make claims such as the following without making it explicit they are referring solely to the zero-shot setting:

However, using image data is still helpful when no other data is available (e.g. row 10 vs 4 in Table 1), and for most species we do not have text describing their ranges, but we can more easily obtain images.

Is this setting ecologically meaningful? In what practical setting would an ecologist have an image but 1) no idea where the image was taken and 2) no ability to describe the image in text? It is reasonable to expect that the zero-shot task may become valuable in the future, but it seems now too toy to be the sole justification for including image conditioning in the main paper, when it is otherwise harmful.

[JzC8-5]

The experiment is interesting. Human-written captions should be considered for a future evaluation.

[JzC8-6]

This seems a duplicate of [JzC8-4]; the evaluation is appreciated.

[JzC8-7]

This evaluation, and the observation that performance saturates at 50 samples, are interesting and the authors should consider including them in the paper.

作者评论

We thank JzC8 for carefully reading our rebuttal and engaging in the discussion. We provide additional responses below.

[JzC8-1]
We are very happy to update the title of the paper to better reflect our main contribution, e.g. “Feedforward Few-Shot Species Range Estimation”. We agree that other works, discussed in our related work section, have already performed evaluation on the problem of few-shot species range estimation. However, the two most relevant papers have either explored active learning using more difficult to obtain “presence-absence” data (Lange et al. NeurIPS 2023) or focused on demonstrating the impact of text more generally (Hamilton et al. NeurIPS 2024). Our paper compares to these works, in addition to previously untested baselines, in a like-for-like way. However, we agree that it is very important that readers do not get the wrong impression from our title. As suggested, we will also update the text to characterize the performance improvement from our approach using more measured language. Thanks for these suggestions.

[JzC8-2]
We commented on the limitations, potential reasons for overfitting, and worse results of adding image supervision in response [nqXQ-4]. Apologies, as we should have linked to this in our initial response to this question. We acknowledge that images are not very helpful overall and will ensure that this point is clear in the text, i.e. we will expand the discussion on L294 Col1 and L373 Col1 where we currently discuss that images are not as informative as text.

To better understand the value of image information, we performed an additional analysis where we compared per-species performance for a model with text and either with or without images. Results can be found here: https://postimg.cc/NyqzzJ8m Values greater than zero indicate that the model with images performs better for a given species, and below zero indicates it performs worse. We observe that while both models have the same overall average, the individual per-species performance differs. This can be attributed to the fact that images help for some species, but actually hurt for others. As noted by JzC8, in general, images do not help when we evaluate different model variants and datasets.

We do not think the quantitative image results detract from our main contribution. There are some interesting results that we believe may be of interest to researchers in this space. For example, Fig 5 illustrates plausible guesses for some images, including non-species images. The “Blue Duck” in Fig A21, illustrates one of the issues with images. Our images are sourced from iNaturalist, the current best large-scale dataset for species classification, where the criteria for inclusion is that an image must contain “evidence of an organism”. The example in Fig A21 only contains footprints and a human hand. While our model manages to localise predictions to some coastal regions, this information may actually harm performance when more informative location data is also provided, as the model may struggle to ignore the image token entirely.

[JzC8-3]
We agree that overfitting is another possible cause of the slightly lower performance with images. We noted this in our original response [nqXQ-4] where we said: “The potential explanation here is that images provide sufficiently weaker signal, and greater opportunity to overfit to incorrect spurious features, thus negatively impacting performance“. Training the image encoder for a single epoch is an interesting suggestion, we will explore this further for the final version of the paper.

[JzC8-4]
As suggested, we will clarify that this statement only holds in the zero-shot setting. More generally, we will also update the text to put the image results into context. Regarding the real-world validity of the zero-shot setting, there may be rare cases where a specimen exists but no location information is available (e.g. old museum specimens). However, we agree that this would not be common. The main purpose of these results is to demonstrate that other forms of meta-data, beyond text, are applicable with our model. There is growing interest in learning joint embedding spaces for different modalities in ecological applications (e.g., Sastry et al. 2025), and future extensions of our work could make use of other forms of data which may be more ecologically relevant such as confirmed absences of species, environmental conditions, satellite imagery, or genetic information.

[JzC8-5]
We agree, these results are interesting. They point to some of the limitations of current captioning models.

[JzC8-7]
Thank you, we will add this to the paper.

审稿意见
4

This paper introduces FS-SINR, a novel Transformer-based approach for few-shot species range estimation that can predict ranges for previously unseen species without requiring retraining. The model architecture combines a location encoder for processing geographic coordinates, a frozen GritLM text encoder for species descriptions, a frozen EVA-02 ViT image encoder for species images, a Transformer encoder that processes combined input tokens, and a species decoder that generates final range predictions. The model uses learned token type embeddings to handle different input modalities and is trained using a modified version of the SINR loss function. On the IUCN and S&T benchmark datasets, FS-SINR achieves state-of-the-art performance, particularly in low-data scenarios (1-10 observations), and can make effective predictions even with a single context location. The model can also generate zero-shot predictions using only text or image inputs, with performance improving when multiple types of context information are combined. Unlike previous approaches, it requires no retraining for new species. The authors note several limitations, including that predictions are deterministic rather than probabilistic, performance depends on the quality and availability of text/image metadata, the approach is subject to biases in training data distribution, and currently only handles presence data rather than confirmed absences. The paper validates these claims through extensive experiments and ablation studies comparing different model components and training strategies.

update after rebuttal

I thank the reviewers for their responses. As there is consensus on accepting the paper, I will keep my accept recommendation. Good work!

给作者的问题

Included my questions above, none critical to the decision

论据与证据

Most claims are well supported. There are certain claims that lack substantial. Examples:

  • "During training, we supply FS-SINR with 20 context locations per training example, though we find that the model performance is very robust to the number of context locations provided during training."
  • There is no quantification of the computational efficiency claimed throughout the paper
  • The paper does not explain how the insights from few-shot species prediction will have downstream impact on biodiversity analysis
  • The initial part of the paper makes claims about support for images, but the experiments reveal limited success. It is unclear if images can be used in practical settings.
  • It is unclear how bias in training data from North America and Europe effect prediction globally.

方法与评估标准

Overall, method and evaluation make sense.

I did not understand why the second term in the loss function is needed. It is possible that some species co-exist with each other. Wouldn't that loss term discourage learning that behavior?

It is unclear how the model adapts to new species which are out of distribution to the training data.

理论论述

No theoretical claims

实验设计与分析

Overall experiment design is sound. I found the zero-shot results difficult to follow, can be presented better. The figures also are unreadable in black and white, can be improved.

It is unclear why SINR outperforms the proposed method when the species data is in the training set. Why doesn't the proposed method performance scale with the number of observations?

Given the loss function used, I would like to see how well the model captures species co-occurrence.

It is unclear why only precision is used as a metric, and recall is ignored.

补充材料

Did not review

与现有文献的关系

The paper's contributions build upon and advance several lines of prior work in species range estimation and machine learning. In terms of few-shot learning for range estimation, it improves upon traditional methods like SINR (Cole et al., 2023) which require model retraining for new species, and LE-SINR (Hamilton et al., 2024) which introduced text-based range estimation but still needs retraining. It's the first method to enable feed-forward range prediction for new species without retraining, showing better performance in low-data scenarios (1-10 samples) than Active SINR (Lange et al., 2023).

While previous works have explored using different data types separately - SINR with only locations, LE-SINR with text, Dollinger et al. (2024) with satellite imagery, and Teng et al. (2023) with species images - this paper provides a unified framework that can flexibly combine all these modalities. In terms of architectural innovation, it introduces transformers to species range estimation, building on recent work using attention mechanisms for geographic tasks (Russwurm et al., 2024), whereas most previous methods relied on MLPs or CNNs.

遗漏的重要参考文献

It is unclear how the proposed method compares against traditional Bayesian approaches, such as Golding et al. (2016)

[1] Golding, N. and Purse, B.V., 2016. Fast and flexible Bayesian species distribution modelling using Gaussian processes. Methods in Ecology and Evolution, 7(5), pp.598-608.

其他优缺点

Covered everything above

其他意见或建议

The symbols used in Section 3.1 can be simplified

作者回复

We thank QZrb for their constructive comments.

[QZrb-1] Robustness to number of training locations.
An ablation of the number of training context locations is provided in Fig. A2. We will update L250 Col1 to more clearly point to this result. As can be seen, fewer context locations (i.e. 5) performs worse, but the difference between 20 and 50 is small.

[QZrb-2] Quantifying efficiency.
Please see response [nqXQ-1].

[QZrb-3] Downstream biodiversity analysis.
Please see response [HT4t-4].

[QZrb-4] Value of images.
Please see response [JzC8-2].

[QZrb-5] Data bias.
As noted on L422 Col1, there are spatial biases in the training data we use, which has more data from North America (NA) and Europe (EU) (Fig. A24). The IUCN evaluation data is not as biased, and from the results in Fig. A23 we can see that we obtain better performance for species in NA and EU, but performance in the less sampled South America is still strong. Addressing training biases is an important question which we leave for future work.

[QZrb-6] Co-occurrence and second term in eqn 2.
This loss is borrowed from prior work (Cole et al. 2023). Without the second term in eqn. 2, the loss is trivially optimized by predicting that every species is present in every location (see Sec 3.2 in Cole et al. 2023). It is true that the second term penalizes the model for predicting that multiple species are present at the same location. However, the large value we use for λ\lambda (see L596) means that this penalty is much weaker than the one for failing to predict that an observed species is present. Thus, if two different species occur near the same location, the model is encouraged over different batches to predict that they are both present.

[QZrb-7] Performance on out of distribution species.
To clarify, by default no data for any of the species in the evaluation datasets is observed during training (L261 Col1). By training on tens of thousands of species, FS-SINR can generalize to previously unseen species at test time. The results in Fig. A23 provide an indication of performance on less common regions. Evaluating truly out of distribution species, i.e. ones that bear no relationship to the training data, is an interesting question but would likely require new evaluation datasets. We opted to use the standard IUCN/S&T datasets so that we could compare fairly to existing work.

[QZrb-8] Performance when training on evaluation species.
In Table 1 we compare to SINR when data from the evaluation species are observed during training (i.e. TST in row 1 vs 3). It is not surprising that SINR performs better as it learns a unique embedding vector for each species (even evaluation ones), whereas FS-SINR must learn the mapping from a small number of locations to a species’ range. These results could be considered as positive as they demonstrate that FS-SINR is not overfitting by simply memorizing the text for each species. Text supervision is very sparse compared to the informative location observations used by SINR to learn its per-species encoding.

[QZrb-9] Performance wrt number of observations.
In Fig 3 we observe that for nearly all methods tested, performance improves as more observations are provided. The largest improvements are observed when going from few (e.g. 10) to many observations but begins to plateau as the number approaches 50. This is consistent with results from existing work where we see more data provides diminishing returns (e.g. Cole et al.).

[QZrb-10] Evaluation metric - recall?
We report performance using both mean average precision (MAP) (Fig 3) and a distance weighted variant of it (Fig A27). MAP is the standard metric from existing work (e.g. Cole et al. and Hamilton et al.). As a reminder, average precision is the area under the precision-recall curve, which is based on precision and recall across a range of thresholds.

[QZrb-11] Comparison to other approaches.
The Gaussian Process (GP) approach in Golding et al. is designed for presence-absence data, but we can adapt it to our presence-only setting using pseudo-negatives. We train a GP classifier using an RBF kernel and a logit link function and a Random Forest (RF) classifier. Both are implemented using sklearn using the raw coordinates as input, with a separate classifier trained per-species. We find that they do not perform as well as FS-SINR which is trained jointly on multiple species. This is especially noticeable on the more challenging IUCN dataset. The results can be found here: https://postimg.cc/0zxWzDFB

[QZrb-12] Misc.
We will improve the readability of the figures and simplify the notation.

最终决定

All reviewers agree to accept this paper. Please include all the necessary changes suggested by the reviewers, e.g., add experiments, improve contextualization of results, and update the title, in the final version.