Active Measurement: Efficient Estimation at Scale
We present active measurement, a human-in-the-loop AI framework that combines AI predictions with importance sampling, model adaptation, and human labeling to make accurate scientific measurements.
摘要
评审与讨论
This paper introduces an "active measurement" strategy, which aims to provide an iterative framework for improving a measurement task through improving the sampling strategy and efficiently involving humans in the loop.
优缺点分析
-
This paper proposes an interesting scheme referred to as "active measurement", building on or borrowing inspirations from areas such as active testing, adaptive importance sampling (AIS), and prediction-powered inference. The discussion the authors provide in Sec. 3 "Related Work" is interesting, but too brief to provide an adequate and useful context to the readers regarding the proposed active measurement scheme. There should be a more comprehensive (and more in-depth) discussion to motivate and position the current study more clearly.
-
A major concern regarding the current study is that the evaluations are very limited. Currently, the authors consider only two counting tasks: (i) counting birds in two high-resolution images; and (ii) counting roosting birds in weather radar images. There are many scientific datasets that the authors could utilize to strengthen their performance assessment. For example, Cryo-EM particle picking and object counting from satellite images would provide good test cases.
-
The study would benefit from a comparison with active learning based strategies, especially the ones that are uncertainty-based. Conceptually, there are many similarities, and it would be beneficial to discuss their similarities and differences, and what specific advantages the active measurement strategy might offer. Furthermore, performance comparison with widely used active learning schemes in Figures 3-6 would be helpful.
-
In the evaluation result (e.g., Fig. 3), it would be helpful to show the performance in terms of the # of labeled units, and not only the fraction of the labeled units.
-
DISCount is an interesting and recent baseline, but to show the general applicability of the proposed active measurement scheme, the authors are strongly encouraged to consider additional baselines. It would be important to see whether one can gain similar consistent improvements across different schemes.
问题
Please see "Strengths and Weaknesses."
局限性
Yes, the authors briefly discuss the limitations of the current work in the Conclusion. However, the paper does not seem to explicitly discuss the potential negative societal impact in the same section, unlike how they responded to the questionnaire section.
最终评判理由
Please note that the overall rating has been increased after reviewing the authors' response to some of the questions/suggestions in the original review comments as well as the additional performance evaluation results they have provided in the rebuttal.
格式问题
N/A
Related Work Depth
“Related Work discussion is too brief to motivate/position the study clearly.”
As another reviewer also suggests, we will expand our related work section to include more discussion of counting and active learning in the literature.
Comparison with Active Learning Strategies
“Comparison with uncertainty-based active learning would be beneficial.”
We include a related comparison in Figure 6 to active testing, where the sampling strategy is proportional to the loss—a commonly used active learning heuristic. In this case, we do not observe much improvement in fine-tuning performance, and the mismatch with the estimation objective results in a noticeable drop in overall performance. We will also incorporate sampling based on other active learning criteria, such as uncertainty and expected model change, in the revision. Thank you for the suggestion.
Evaluation Metrics
“In Fig. 3, show # of labeled units, not only fraction.”
Thanks for this feedback, we will update the plot to make the # of labeled units more obvious in the final version.
Additional Baselines
“Encouraged to consider additional baselines beyond DISCount.”
For the counting problem, we tested two additional image-counting baselines: one motivated by prediction powered inference (PPI) and the other by active testing. Please see Figure 6 in the main text and Figure A8 in the supplemental material. As mentioned above, we plan to include comparisons with different active learning heuristics for selection in the revision.
Societal Impacts
“The paper does not explicitly discuss potential negative societal impacts”
Thank you for this observation. We briefly touch on this in Section 8, noting that our method is problem-agnostic and aims to improve the accuracy of existing measurement workflows. That said, we agree it would be helpful to more clearly address potential societal impacts, and we will ensure this is clear in the next version.
Limited Evaluation Scope
“Only two counting tasks; consider using Cryo-EM, satellite images.”
We thank the reviewers for their suggestions for additional domains to apply our work. In response, we performed experiments on two additional datasets: malaria infected cell counting and damaged building counting from satellite images. Specifically, we used image set BBBC041v1, available from the Broad Bioimage Benchmark Collection [1], and satellite images of the Palu Tsunami from the xBD dataset [2].
Fractional Error Rate On Malaria Cell Dataset
| % Labeled | DIS | DIS+WOR | DIS+AIS | DIS+AIS+WOR |
|---|---|---|---|---|
| 1 | 0.1946 | 0.1878 | 0.1932 | 0.1804 |
| 3 | 0.1254 | 0.1105 | 0.1153 | 0.0996 |
| 5 | 0.0955 | 0.0838 | 0.0850 | 0.0800 |
| 10 | 0.0683 | 0.0612 | 0.0594 | 0.0558 |
| 30 | 0.0407 | 0.0324 | 0.0344 | 0.0277 |
| 50 | 0.0318 | 0.0239 | 0.0265 | 0.0173 |
The Malaria Cell dataset comprises 1,364 images (~80,000 cells). For this evaluation, we focus on counting the number of infected cells, which are around 5% of the dataset. Due to time constraints we do not perform any hyperparameter tuning, and use the same settings as our Sky and Reeds image experiments. Results are averaged over 500 trials. For our initial model we finetune the default Faster R-CNN network on three randomly selected cell images.
Looking at the first table, we see similar trends to the image and radar experiments. AIS reduces error fairly early on as the model improves. WOR also has a consistent improvement, with a larger impact after around 10% of the data is labeled.
Fractional Error Rate On Building Dataset
| % Labeled | DIS | DIS+WOR | DIS+AIS | DIS+AIS+WOR |
|---|---|---|---|---|
| 1 | 0.9708 | 0.9708 | 0.9863 | 0.9233 |
| 3 | 0.8707 | 0.8757 | 0.7920 | 0.8413 |
| 5 | 0.7706 | 0.7804 | 0.6720 | 0.7185 |
| 10 | 0.5653 | 0.5807 | 0.5517 | 0.5742 |
| 30 | 0.3961 | 0.3530 | 0.3586 | 0.2931 |
| 50 | 0.3025 | 0.1910 | 0.2707 | 0.1595 |
For damaged building detection we focus on the Palu Tsunami subset of xBD which contains 113 satellite images of the shoreline affected by the tsunami. We count the number of damaged buildings, which corresponds to a label of “major-damage” or “destroyed”. Only the post disaster images are given to the model. We use the same hyperparameters as before, and results are averaged over 1,000 trials. The initial model is trained on five randomly selected images.
The results on this dataset follow similar trends: WOR provides the most benefit when a larger fraction is labeled, while AIS is more helpful early on. There is a bit more noise with this dataset, so we plan to average over more trials in the next version of the paper. We also note that our performance is lower than the numbers reported in DISCount [3], as they use a detector specifically trained for damaged building detection, whereas we use a simple ImageNet-pretrained backbone fine-tuned on just 5 satellite images.
[1] Ljosa et al., Nature Methods, 2012
[2] Gupta, R., et al, "Creating xBD: A Dataset for Assessing Building Damage from Satellite Imagery," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2019.
[3] Perez, G., Maji, S., & Sheldon, D. (2024). DISCount: Counting in Large Image Collections with Detector-Based Importance Sampling. Proceedings of the AAAI Conference on Artificial Intelligence, 38(20), 22294-22302.
I would like to thank the authors for their detailed response. I believe the proposed revision plan will certainly improve the clarity of the proposed scheme's novelty, significance, and technical details (especially, against other relevant methods and baselines).
I especially appreciate the additional evaluation results based on the "malaria-infected cell counting" and the "damaged building counting" tasks. It's good to see that the proposed method shows similar trends in these two additional tasks.
Based on the additional results and explanations provided by the authors, I'm raising my original evaluation score. Thank you.
This paper address the problem of how to rapidly train a model with human input in the loop. The paper focuses on training a detector for the purposes of measuring a natural phenomenon, although it seems reasonable that the same technique could be used for other purposes. The approach builds on prior art where data are subsampled for labelling, but extends the prior art by focusing on sampling without replacement, compares different weighting methods in the context of sampling without replacement, and derives a new estimate of the variance. The performance is evaluated on two types of bird detectors.
优缺点分析
Overall I quite liked this paper. The problem is clear, the writing is clear and the contribution is clear.
I have a few minor technical concerns, but I do not view these as serious problems. Firstly, the acquisition function assumes the estimates across the data set for are independent. I note that the authors pay quite a lot of attention to the loss of independence between timesteps, but there is also correlation within the sample set , e.g., correlation between different tiles in the image. In the active sensing domain (which is different to this domain) it is common practice to learn a Gaussian process that describes how sample sets within the time step might be correlated.
Secondly, the variance estimates and the acquisition function as written do not incorporate an estimate of the variance from directly. provides only the estimated number of measurements, but it is relatively straightforward to modify to return a variance as well, especially since line 8 of Algorithm 1 requires retraining the AI model. Incorporating the variance estimate from the AI model providing f(s_t) would allow, for instance, the model to avoid asking for estimates (line 7 of Algorithm 2) in regions where the signal is fundamentally ambiguous (e.g., a high noise part of the image).
Finally, providing a variance estimate from the AI model and a GP model of correlations across the domain would allow the overall model to predict where measurements are most likely to improve. Asking for labels in a region that is high variance and strongly correlated with a labelled region that is also high variance would most likely be unhelpful. In the active learning setting (which is also related but different to this domain), it is most helpful to ask for labels where the learner performance is predicted to improve.
The experimental results are good, but are fairly specific. It would have been helpful to perform a similar analysis on a different type of data. I appreciated the error bars on the right hand panel of Figure 1, but it was not clear if these were ground truth error bars averaged over or if these were the predicted variances. I would also have liked to have seen these more carefully analysed in the experimental results section.
The limitations section was appropriate.
I liked this paper and would argue for acceptance.
问题
-
Are the error bars on the right hand panel of Figure 1 ground truth error bars or predicted variances. How do they compare? Can you provide them for both domains?
-
Is there a reason not to incorporate a variance estimate from the AI model itself?
局限性
Yes
最终评判理由
I read the authors response, and I am still generally supportive of this paper.
格式问题
N/A
Clarification on Error Bars (Fig. 1)
“Are the error bars ground truth or predicted variances? Can you provide for both domains?”
The error bars are 95% confidence intervals based on predicted variances. Since this figure is from a single trial (unlike the averaged results in the results section), ground truth variances are not available. Thank you for pointing this out, we will clarify it in the next version. More detailed results on predicted variances are included in Figure A7 of the supplemental.
Model Variance Incorporation
“Is there a reason not to incorporate a variance estimate from the AI model itself?”
We want to clarify that we focus on the variance of the estimation, which is different from the variance of the AI model. There are two challenges to using the uncertainty of the AI model. First, we sometimes need more advanced methods like ensembling and Bayesian neural networks to estimate the uncertainty of an AI model. Our scope of work is generic but specific improvements can be introduced as extensions of this work. Second, it is indirect to link the variances of the model to the variances of estimation. The predicted measurements are normalized and lie in the denominators in estimation, which by itself introduces interesting but non-trivial statistical problems.
Sample Correlation Concerns
“There is correlation within the sample set, e.g., between tiles in an image; could consider a GP model.”
These comments are very astute and touch on directions we are actively considering. A concrete example is the “reeds” image (Figure 1), where birds follow a clear spatial distribution.
First, we would like to point out that this correlation does not invalidate any of our analysis. This is because we work within a finite sample framework, where all images and ground truth counts are fixed and non-random. This differs from a typical ML setup, which assumes data are drawn from an underlying distribution. Philosophically, this is because we are not interested in generalizing to future images from the same distribution—we only care about performance on the large but finite set of images. As a consequence, the only randomness in our estimator is sequence of sampled units and any randomness in the training of the AI model at each step. Our bias and variance analyses correctly handle these sources of randomness by considering the bias and variance at each step due to the sample conditional on all prior choices.
That said, considering correlation among images could be very useful. One promising opportunity is to better predict counts and thus improve proposal distributions to reduce the variance of active measurement. As an example, in the reeds image we have observed spatial auto-correlation in the residuals of the AI model. Fitting a GP can exploit this spatial structure to improve predictions for unlabeled tiles.
As you point out, a GP can also be used to model uncertainty in an active learning framework to help select samples that will improve the AI model as much as possible. This is an interesting direction for future work (we mentioned it very briefly in Lines 359–362.
Domain Generality
“Experimental results are good but fairly specific; would have been helpful to test on different data types.”
We thank the reviewers for their suggestions for additional domains to apply our work. In response, we performed experiments on two additional datasets: malaria infected cell counting and damaged building counting from satellite images. Specifically, we used image set BBBC041v1, available from the Broad Bioimage Benchmark Collection [1], and satellite images of the Palu Tsunami from the xBD dataset [2].
Fractional Error Rate On Malaria Cell Dataset
| % Labeled | DIS | DIS+WOR | DIS+AIS | DIS+AIS+WOR |
|---|---|---|---|---|
| 1 | 0.1946 | 0.1878 | 0.1932 | 0.1804 |
| 3 | 0.1254 | 0.1105 | 0.1153 | 0.0996 |
| 5 | 0.0955 | 0.0838 | 0.0850 | 0.0800 |
| 10 | 0.0683 | 0.0612 | 0.0594 | 0.0558 |
| 30 | 0.0407 | 0.0324 | 0.0344 | 0.0277 |
| 50 | 0.0318 | 0.0239 | 0.0265 | 0.0173 |
The Malaria Cell dataset comprises 1,364 images (~80,000 cells). For this evaluation, we focus on counting the number of infected cells, which are around 5% of the dataset. Due to time constraints we do not perform any hyperparameter tuning, and use the same settings as our Sky and Reeds image experiments. Results are averaged over 500 trials. For our initial model we finetune the default Faster R-CNN network on three randomly selected cell images.
Looking at the first table, we see similar trends to the image and radar experiments. AIS reduces error fairly early on as the model improves. WOR also has a consistent improvement, with a larger impact after around 10% of the data is labeled.
Fractional Error Rate On Building Dataset
| % Labeled | DIS | DIS+WOR | DIS+AIS | DIS+AIS+WOR |
|---|---|---|---|---|
| 1 | 0.9708 | 0.9708 | 0.9863 | 0.9233 |
| 3 | 0.8707 | 0.8757 | 0.7920 | 0.8413 |
| 5 | 0.7706 | 0.7804 | 0.6720 | 0.7185 |
| 10 | 0.5653 | 0.5807 | 0.5517 | 0.5742 |
| 30 | 0.3961 | 0.3530 | 0.3586 | 0.2931 |
| 50 | 0.3025 | 0.1910 | 0.2707 | 0.1595 |
For damaged building detection we focus on the Palu Tsunami subset of xBD which contains 113 satellite images of the shoreline affected by the tsunami. We count the number of damaged buildings, which corresponds to a label of “major-damage” or “destroyed”. Only the post disaster images are given to the model. We use the same hyperparameters as before, and results are averaged over 1,000 trials. The initial model is trained on five randomly selected images.
The results on this dataset follow similar trends: WOR provides the most benefit when a larger fraction is labeled, while AIS is more helpful early on. There is a bit more noise with this dataset, so we plan to average over more trials in the next version of the paper. We also note that our performance is lower than the numbers reported in DISCount [3], as they use a detector specifically trained for damaged building detection, whereas we use a simple ImageNet-pretrained backbone fine-tuned on just 5 satellite images.
[1] Ljosa et al., Nature Methods, 2012
[2] Gupta, R., et al, "Creating xBD: A Dataset for Assessing Building Damage from Satellite Imagery," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2019.
[3] Perez, G., Maji, S., & Sheldon, D. (2024). DISCount: Counting in Large Image Collections with Detector-Based Importance Sampling. Proceedings of the AAAI Conference on Artificial Intelligence, 38(20), 22294-22302.
Thank you for the detailed response. I am still generally supportive of this paper.
The paper focuses on efficiently measuring objects at scale. This problem is especially relevant for analyzing datasets in scientific domains. The paper introduces active measurement, a framework that uses a human-in-the-loop pipeline and machine learning models to assist with measurements. Collecting labels for a large number of instances is costly and time-consuming, making it difficult to obtain accurate measurements of items. Inspired by active testing and active learning, this framework iteratively gathers labels from humans and refines the estimates of the items. The work proposes new unbiased estimators, weighting schemes, and confidence intervals for improved estimates. The experiments show that they significantly reduce the estimation errors compared to the baselines.
优缺点分析
Strengths
- The paper is well written. The paper does a great job of motivating the problem of estimating counts at scale.
- The paper effectively addresses large-scale counting issues by developing new estimators, weighting methods, and confidence intervals for human-in-the-loop scenarios.
- The paper makes strong theoretical contributions and provides comprehensive proofs for several propositions in the paper.
- The experiment setup is also reasonable and fair. The experiments show that the proposed method consistently shows lower fractional errors compared to the baselines.
Weaknesses
Experiments The paper primarily focuses on scientific datasets related to bird counting. While they show strong results on these datasets, the paper would be strengthened if it included results on popular crowd counting datasets [a, b].
LURE The paper improves over LURE weights for a pool-based setting with sampling without replacement. I would like to see if the improvements from Proposition 3 improve active learning frameworks as well.
Nit Line 23-34: The authors have described the problem of measurement they focus on in detail. Could they cite a reference to prior work that also identifies the same problem?
References
[a] Composition Loss for Counting, Density Map Estimation and Localization in Dense Crowds. ECCV 2018.
[b] Multi-Source Multi-Scale Counting in Extremely Dense Crowd Images. CVPR 2013.
问题
Please see weaknesses.
局限性
yes
最终评判理由
I have been in support of this paper. The authors have agreed to add additional references and experiments on additional datasets in the paper.
格式问题
N/A
Application to Active Learning
“Would like to see if Proposition 3 improvements help active learning frameworks.”
In the context of active learning, our theoretical analyses mirror two types of biases: (1) bias due to sampling without replacement and (2) bias due to a varying acquisition function. With the biases, the samples are not acquired uniformly and should be weighted. There can be similar results as ours by formalizing the variances against the acquisition functions over the sample spaces. We think they may lead to novel weighting strategies that help active learning frameworks.
Missing Reference
“Could they cite a reference to prior work that identifies the same measurement problem?”
Thanks for pointing out the additional references of counting problems above. We will discuss them in our next version. For this work, we mainly focus on counting problems that have biological and ecological implications. For example, there have been similar problems of counting penguins [1], bats [2], and other birds [3] in those domains.
Additional Datasets
“The paper would be strengthened if it included results on popular crowd counting datasets [ECCV 2018, CVPR 2013].”
We thank the reviewers for their suggestions for additional domains to apply our work. In response, we performed experiments on two additional datasets: malaria infected cell counting and damaged building counting from satellite images. Specifically, we used image set BBBC041v1, available from the Broad Bioimage Benchmark Collection [4], and satellite images of the Palu Tsunami from the xBD dataset [5].
Fractional Error Rate On Malaria Cell Dataset
| % Labeled | DIS | DIS+WOR | DIS+AIS | DIS+AIS+WOR |
|---|---|---|---|---|
| 1 | 0.1946 | 0.1878 | 0.1932 | 0.1804 |
| 3 | 0.1254 | 0.1105 | 0.1153 | 0.0996 |
| 5 | 0.0955 | 0.0838 | 0.0850 | 0.0800 |
| 10 | 0.0683 | 0.0612 | 0.0594 | 0.0558 |
| 30 | 0.0407 | 0.0324 | 0.0344 | 0.0277 |
| 50 | 0.0318 | 0.0239 | 0.0265 | 0.0173 |
The Malaria Cell dataset comprises 1,364 images (~80,000 cells). For this evaluation, we focus on counting the number of infected cells, which are around 5% of the dataset. Due to time constraints we do not perform any hyperparameter tuning, and use the same settings as our Sky and Reeds image experiments. Results are averaged over 500 trials. For our initial model we finetune the default Faster R-CNN network on three randomly selected cell images.
Looking at the first table, we see similar trends to the image and radar experiments. AIS reduces error fairly early on as the model improves. WOR also has a consistent improvement, with a larger impact after around 10% of the data is labeled.
Fractional Error Rate On Building Dataset
| % Labeled | DIS | DIS+WOR | DIS+AIS | DIS+AIS+WOR |
|---|---|---|---|---|
| 1 | 0.9708 | 0.9708 | 0.9863 | 0.9233 |
| 3 | 0.8707 | 0.8757 | 0.7920 | 0.8413 |
| 5 | 0.7706 | 0.7804 | 0.6720 | 0.7185 |
| 10 | 0.5653 | 0.5807 | 0.5517 | 0.5742 |
| 30 | 0.3961 | 0.3530 | 0.3586 | 0.2931 |
| 50 | 0.3025 | 0.1910 | 0.2707 | 0.1595 |
For damaged building detection we focus on the Palu Tsunami subset of xBD which contains 113 satellite images of the shoreline affected by the tsunami. We count the number of damaged buildings, which corresponds to a label of “major-damage” or “destroyed”. Only the post disaster images are given to the model. We use the same hyperparameters as before, and results are averaged over 1,000 trials. The initial model is trained on five randomly selected images.
The results on this dataset follow similar trends: WOR provides the most benefit when a larger fraction is labeled, while AIS is more helpful early on. There is a bit more noise with this dataset, so we plan to average over more trials in the next version of the paper. We also note that our performance is lower than the numbers reported in DISCount [6], as they use a detector specifically trained for damaged building detection, whereas we use a simple ImageNet-pretrained backbone fine-tuned on just 5 satellite images.
[1] Arteta, C., Lempitsky, V., & Zisserman, A. (2016, September). Counting in the wild. In European conference on computer vision (pp. 483-498). Cham: Springer International Publishing.
[2] Horn, J. W., & Kunz, T. H. (2008). Analyzing NEXRAD doppler radar images to assess nightly dispersal patterns and population trends in Brazilian free-tailed bats (Tadarida brasiliensis). Integrative and Comparative Biology, 48(1), 24-39.
[3] Buler, J. J., Randall, L. A., Fleskes, J. P., Barrow Jr, W. C., Bogart, T., & Kluver, D. (2012). Mapping wintering waterfowl distributions using weather surveillance radar. PloS one, 7(7), e41571.
[4] Ljosa et al., Nature Methods, 2012
[5] Gupta, R., et al, "Creating xBD: A Dataset for Assessing Building Damage from Satellite Imagery," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2019.
[6] Perez, G., Maji, S., & Sheldon, D. (2024). DISCount: Counting in Large Image Collections with Detector-Based Importance Sampling. Proceedings of the AAAI Conference on Artificial Intelligence, 38(20), 22294-22302.
This paper explores methods for active measurement (human in the loop) to improve precisions of measurements and provide unbiased MC error estimates in imperfect AI model. The task is tested for counting birds in 2 different datasets and improved performance is demonstrated.
优缺点分析
The paper is clear and presents the approach for the biological domain. It describes novel contributions beyond state of the art (DIS). Active measurement is an under-explored area in the sciences and this study systematically studies augmentations to prior work.
It would be nice to understand when and why WOR vs AIS gives larger and smaller improvements in the dataset. The relative contributions vary depending on the dataset. It would also be nice to demonstrated this on another domain (e.g. cell counting?) instead of 2 similar domain applications.
问题
It would be nice to understand when and why WOR vs AIS gives larger and smaller improvements in the dataset. The relative contributions vary depending on the dataset. It would also be nice to demonstrated this on another domain (e.g. cell counting?) instead of 2 similar domain applications.
局限性
Demonstrations on other domains to understand how the approach may vary would make the result more significant and applicable.
格式问题
none
WOR vs AIS Clarification
“It would be nice to understand when and why WOR vs AIS gives larger and smaller improvements in the dataset.”
In theory, WOR consistently reduces estimation error regardless of the detector, with improvements becoming more pronounced as more units are labeled. In contrast, AIS yields the greatest benefit when fine-tuning leads to substantial improvements in model predictions. Additional gains arise when our weighting strategy aligns with the (unknown) error rates. That said, our weighting scheme includes guarantees that bound the worst-case error, ensuring robustness even when this alignment is imperfect.
Domain Generality
“It would also be nice to demonstrate this on another domain (e.g., cell counting) instead of two similar domain applications.”
We thank the reviewers for their suggestions for additional domains to apply our work. In response, we performed experiments on two additional datasets: malaria infected cell counting and damaged building counting from satellite images. Specifically, we used image set BBBC041v1, available from the Broad Bioimage Benchmark Collection [1], and satellite images of the Palu Tsunami from the xBD dataset [2].
Fractional Error Rate On Malaria Cell Dataset
| % Labeled | DIS | DIS+WOR | DIS+AIS | DIS+AIS+WOR |
|---|---|---|---|---|
| 1 | 0.1946 | 0.1878 | 0.1932 | 0.1804 |
| 3 | 0.1254 | 0.1105 | 0.1153 | 0.0996 |
| 5 | 0.0955 | 0.0838 | 0.0850 | 0.0800 |
| 10 | 0.0683 | 0.0612 | 0.0594 | 0.0558 |
| 30 | 0.0407 | 0.0324 | 0.0344 | 0.0277 |
| 50 | 0.0318 | 0.0239 | 0.0265 | 0.0173 |
The Malaria Cell dataset comprises 1,364 images (~80,000 cells). For this evaluation, we focus on counting the number of infected cells, which are around 5% of the dataset. Due to time constraints we do not perform any hyperparameter tuning, and use the same settings as our Sky and Reeds image experiments. Results are averaged over 500 trials. For our initial model we finetune the default Faster R-CNN network on three randomly selected cell images.
Looking at the first table, we see similar trends to the image and radar experiments. AIS reduces error fairly early on as the model improves. WOR also has a consistent improvement, with a larger impact after around 10% of the data is labeled.
Fractional Error Rate On Building Dataset
| % Labeled | DIS | DIS+WOR | DIS+AIS | DIS+AIS+WOR |
|---|---|---|---|---|
| 1 | 0.9708 | 0.9708 | 0.9863 | 0.9233 |
| 3 | 0.8707 | 0.8757 | 0.7920 | 0.8413 |
| 5 | 0.7706 | 0.7804 | 0.6720 | 0.7185 |
| 10 | 0.5653 | 0.5807 | 0.5517 | 0.5742 |
| 30 | 0.3961 | 0.3530 | 0.3586 | 0.2931 |
| 50 | 0.3025 | 0.1910 | 0.2707 | 0.1595 |
For damaged building detection we focus on the Palu Tsunami subset of xBD which contains 113 satellite images of the shoreline affected by the tsunami. We count the number of damaged buildings, which corresponds to a label of “major-damage” or “destroyed”. Only the post disaster images are given to the model. We use the same hyperparameters as before, and results are averaged over 1,000 trials. The initial model is trained on five randomly selected images.
The results on this dataset follow similar trends: WOR provides the most benefit when a larger fraction is labeled, while AIS is more helpful early on. There is a bit more noise with this dataset, so we plan to average over more trials in the next version of the paper. We also note that our performance is lower than the numbers reported in DISCount [3], as they use a detector specifically trained for damaged building detection, whereas we use a simple ImageNet-pretrained backbone fine-tuned on just 5 satellite images.
[1] Ljosa et al., Nature Methods, 2012
[2] Gupta, R., et al, "Creating xBD: A Dataset for Assessing Building Damage from Satellite Imagery," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2019.
[3] Perez, G., Maji, S., & Sheldon, D. (2024). DISCount: Counting in Large Image Collections with Detector-Based Importance Sampling. Proceedings of the AAAI Conference on Artificial Intelligence, 38(20), 22294-22302.
This paper introduces a human-in-the-loop framework that combines predictions with importance sampling and iterative human labeling. To get unbiased estimates for scientific measurements, the authors propose estimators, weighting schemes, and confidence intervals for sampling without replacement. While the reviewers appreciate the theoretical contributions and potential impact, I encourage the authors to have a more candid discussion of the method's practical limitations. For example, this could include i) computational overhead of iterative model fine-tuning, ii) when precision may still be insufficient for high-stakes applications, iii) methodological compromises in evaluation and algorithm design, e.g., fixed checkpoint approximation, and iv) candid discussion of the scope of this work.