PaperHub
4.5
/10
Poster4 位审稿人
最低3最高6标准差1.1
4
6
5
3
3.0
置信度
正确性3.0
贡献度2.8
表达2.8
NeurIPS 2024

Segment Any Change

OpenReviewPDF
提交: 2024-05-02更新: 2024-11-06
TL;DR

Our training-free adaptation enables Segment Anything Model (SAM) to zero-shot change detection by exploiting the latent space of SAM

摘要

Visual foundation models have achieved remarkable results in zero-shot image classification and segmentation, but zero-shot change detection remains an open problem. In this paper, we propose the segment any change models (AnyChange), a new type of change detection model that supports zero-shot prediction and generalization on unseen change types and data distributions. AnyChange is built on the segment anything model (SAM) via our training-free adaptation method, bitemporal latent matching. By revealing and exploiting intra-image and inter-image semantic similarities in SAM's latent space, bitemporal latent matching endows SAM with zero-shot change detection capabilities in a training-free way. We also propose a point query mechanism to enable AnyChange's zero-shot object-centric change detection capability. We perform extensive experiments to confirm the effectiveness of AnyChange for zero-shot change detection. AnyChange sets a new record on the SECOND benchmark for unsupervised change detection, exceeding the previous SOTA by up to 4.4% F$_1$ score, and achieving comparable accuracy with negligible manual annotations (1 pixel per image) for supervised change detection. Code is available at https://github.com/Z-Zheng/pytorch-change-models.
关键词
change detectionremote sensingzero-shot adaptationvisual foundation modelssatellite imagery

评审与讨论

审稿意见
4

This paper proposed a new type of change detection model that supports zero-shot prediction and generalization on unseen change types and data distributions. The proposed method called AnyChange is built on the segment anything model (SAM) via our training-free adaptation method. By revealing and exploiting intra-image and inter-image semantic similarities in SAM’s latent space, the proposed AnyChange could perform change detection. The authors also designed 1 1-point query mechanism for AnyChange, leveraging SAM’s point prompt mechanism and our bitemporal latent matching for filtering desired object changes.

优点

  1. Building a foundation model for object change detection is novel and interesting.
  2. The authors provided comprehensive experimental results on 4 various datasets. Some designed baselines such as different SAM variants (such as SAM+Mask Match, SAM+CVA Match, and DINOv2+CVA) are reasonable.
  3. The authors considered various settings to demonstrate the ability of the proposed method and better revealed the power of the proposed method.

缺点

  1. I have one main concern about the capacity of the SAM. Due to SAM being optimized by dense masks with the high-level implicit semantics, I doubt whether SAM has the ability to detect some very minor changes if the authors are performing bitemporal latent matching in the feature space of SAM. The authors should provide more demonstrations about this.

  2. Have the authors evaluated the robustness of their proposed method to the illumination/color changes and viewpoint changes?

  3. As reported in Table 3, why did not the authors report the experimental results of using 10% GT and 100% GT? From current results, the proposed method did not demonstrate the advantage over existing algorithms even though I know they used fewer annotations. The accuracy is far below the existing results, which is not acceptable. The current results only demonstrate that the proposed method has a stronger zero-shot ability but I am curious about the upper bound of the proposed method if the proposed method is conducted at the same experimental setting. Furthermore, I do not really understand the meaning of "This confirms the potential of AnyChange as a change data engine for supervised object change detection." in Lines 308-309.

  4. The authors should create one new table to combine some results (on the S2Looking dataset) from Table 1 and Table 3 to provide a better comparison. Table 1 demonstrates the zero-shot ability of some designed baselines and Table 3 reports the results under the supervised setting. However, I also noticed that AnyChange (Oracle) only achieved 62.2, 57.6, and 67.6 for F1, Prec. and Rec., respectively on S2Looking dataset. Meanwhile, the best results on the S2Looking dataset in Table 3 are 67.9, 70.3, and 65.7. I am very doubtful about the ability of the proposed method as the author said "Oracles obtained via supervised learning have superior precision" (Line 248-249). This observation also aligns with my doubt about whether it is suitable to perform change detection built on SAM.

If the authors could address such main concerns, I am willing to raise my score after the rebuttal.

Some minor issues:

  1. The generalization ability of the proposed method to unseen images as they claim the proposed model is a foundation model.

  2. The figures should be reorganized to better illustrate the differences between the pre-event image and the post-event image.

  3. The best results in Table 3 should be bold.

问题

Please refer to the item 1 and 4 in the weakness part. I am very skeptical about the ability of SAM to detect the object changes. From the existing experimental results, the proposed AnyChange did not show any superiority over the existing algorithms under the supervised setting.

局限性

The results of the current version are not that convincing to demonstrate that the proposed AnyChange is better than existing algorithms under the supervised setting. I am also skeptical about the ability of SAM to detect the minor changes, since SAM is mainly optimized by dense masks with some implicit semantics. From my point of view, SAM should not have a strong ability to identify small changes from its training nature.

作者回复

To Reviewer n4V1

W1: I doubt whether SAM has the ability to detect some very minor changes.

As you suggested, we demonstrate the case of tiny/minor changes, e.g., small vehicle changes. Please check Figure 4 in the rebuttal PDF. The main observation is that directly applying AnyChange to the original image overlooked these subtle changes (see the first row of Figure 4). After we bilinearly upsampled red box region by 2x and then applied AnyChange to it, we found some tiny/minor changes can be detected (see the second row of Figure 4). This observation shows that our method has the ability on tiny object change detection, while it is not perfect. Future works can take our method as a strong baseline and improve this point further.


W2: the robustness to the illumination/color changes and viewpoint changes

  • illumination/color changes have been simulated by randomly applying color jitter to the pre and post-event images, respectively. We used ViT-B as the backbone for fast experiments. The results are presented as follows. The performance jitter of mask AR is less than 2% (-1.9%,+0.1%, -1.9%, -1.4%). We think this sensitivity to color variance is acceptable.
LEVIR-CDS2LookingxView2SECOND
conditionF1 / Prec / Rec / mask ARF1 / Prec / Rec / mask ARF1 / Prec / Rec / mask ARF1 / Prec / Rec / mask AR
baseline23.4 / 13.7 / 83.0 / 32.67.4 / 3.9 / 94.0 / 48.313.4 / 7.6 / 59.3 / 27.844.6 / 30.5 / 83.2 / 27.0
w/ color jitter22.6 / 13.1 / 84.2 / 30.77.4 / 3.8 / 94.1 / 48.413.5 / 7.7 / 54.7 / 25.942.4 / 28.7 / 81.7 / 26.4
  • viewpoint change cannot be simply simulated in image space, therefore, the above experiments cannot be conducted. Sorry for this. However, we indeed have considered evaluating the viewpoint robustness of our method since S2Looking is exactly an off-nadir/side-looking (can be seen as different viewpoints) building change detection dataset. Compared with our baselines, AnyChange performs better under viewpoint changes.

W3: Concerns about Table 3.

We believe there is a misunderstanding in Table 3. Sorry for the confusion. Table 3 presents a data-centric experiment to explore the possibility of AnyChange as a change data engine. The network architecture is fixed ChangeStar (1x96). The only difference in the comparison is the training labels used by the network. Our model (last row) is the network trained on the predictions produced by our AnyChange. The compared models (5th, 9th, 10th rows) are the same network trained with 100%, 1%, and 0.1% of ground truth labels (GT).

W3.1: in Table 3, why did not authors report results of using 10% GT and 100% GT?

In our original submission, we have reported the performance of models trained on 100% GT (see 5th row of Table 3). The quantitative gap between the labels generated by our AnyChange and 100% GT has been revealed. When reducing the number of GT to 1% and 0.1%, our result is better than both of them, although 0.1% (4.7x106^6 pixels) still greatly exceeded the number of labels we used for prompting (3.5x103^3 pixels). Therefore, we decided that testing the 10% GT case was unnecessary.

W3.2/Q1: the proposed method did not demonstrate the advantage over existing algorithms even though I know they used fewer annotations. The accuracy is far below the existing results, which is not acceptable.

  • 100% GT comes from manual labeling, our pseudo-label is generated by AnyChange with point prompts in a zero-shot way. Therefore, it is reasonable and acceptable that 100% GT trained model has superior accuracy than ours. This is because pure manual labeling is the upper bound of our data engine based on AnyChange.

  • The advantage of our method is 1) training-free. 2) enabling human-in-loop. 3) better zero-shot performance. 4) SOTA performance under the unsupervised setting. 5) better performance under the supervised setting with fewer annotations. All of these are not achieved by existing algorithms.

W3.3: upper bound of the proposed method when conducted in the same experimental setting.

We conducted this experiment by training a pure pixel-based ViT-B-based AnyChange model in a fully supervised manner, exactly as done with the other methods in Table 3. The F1, precision, and recall are 68.2, 71.8, and 64.9, respectively. This is the upper bound of AnyChange as a change detection network architecture.

W3.4: the meaning of "This confirms the potential of AnyChange as a change data engine for supervised object change detection."

Here we claimed that AnyChange can be utilized as a more efficient and interactive labeling tool, functioning as a change data engine, capable of providing superior change pseudo-labels with fewer manual labels as prompts. This claim is supported by Table 3 (9-11 th rows).


W4.1: The gap between AnyChange (Oracle)[F1: 62.2] and previous best result [F1: 67.9].

  • AnyChange (Oracle) is an instance-level, promptable change detector, not specialized in pixel-level change detection. In Table 1, we need to evaluate it at both pixel and instance levels. Therefore, its trainable modules consist only of LoRA layers and a change score network. This setup allows us to determine the upper bound of embedding dissimilarity while preserving its promptability and zero-shot instance prediction capabilities. If we drop this setup and train a pixel-level specialist, it achieves a better F1 of 68.2%.

  • Previous best is only a pixel-level change detection specialist with extra syntheic data pre-training and 100% GT fine-tuning.

W4.2: doubt the ability of the proposed method as authors said "Oracles obtained via supervised learning have superior precision"

We only claimed this point in Table 1. AnyChange (Oracle) with supervised learning has higher precision (metric) than its zero-shot counterparts.

评论

I appreciate the author's effort to address my questions. However, after reading the rebuttal, I still have concerns about the motivation for utilizing the feature space of SAM for detecting minor changes between the images. SAM is grouping similar elements together. The feature space is not that perfect for detecting some minor changes due to its training nature.

I also read other reviews, especially comments from reviewer MDhn. I agree with his/her concern. Designing the bitemporal latent matching in the feature space of SAM is not that significant and lacks an intuitive explanation.

Thus I keep my rating unchanged.

评论

Thank you for your helpful feedback. Can we assume other concerns have been addressed except the concern about minor changes? If not, please feel free to share here. We are happy to provide detailed illustrations.


  • Our work is not a minor change-tailored change detection method. We also have not claimed any superior change detection performance on minor changes. I have shown that our method can detect some small changes, to some extent. However, this is beyond the scope of our paper. What we claimed is still around zero-shot capability. We understand you pointed out tiny change is hard to capture by SAM, however, please also understand this point is orthogonal to our contributions.

  • Bitemporal latent matching enables SAM to obtain zero-shot change detection capability, which is one of the most important problems for change detection. Our results have supported that bitemporal latent matching is better than other compared matching strategies. We respect your subjective opinion that claims our design is not significant. If there is any other more significant design you think in the community, we are happy to compare with it as soon as possible.


For a new concern about "lack of an intuitive explanation".

Our Section 3.2 has provided a comprehensive motivation and evidence to illustrate bitemporal latent matching's core which is to exploit intra-image and inter-image semantic similarities in SAM's latent space. We believe those two figures are very intuitive. We are sorry if you still feel that are not intuitive. We believe this is an easy-to-address problem with careful revision.


Thank you once again for your comments and feedback. Respect your opinion.

评论

I really appreciate the authors' very prompt responses. To better support my rating and have more thoughtful discussions, I explain my reasons in detail as follows:

The authors have clearly addressed my concerns except for the generalization ability to some unseen images. However, I still have some concerns about the choice of performing latent matching in the feature space of SAM. SAM was optimized by segmentation supervision, grouping similar elements together. The latent feature space was obtained after downsampling or pooling, which may weaken the ability to detect some minor changes in the feature space. Even though the authors provided some empirical results about this, it still lack some grounded analysis about this.

I acknowledge the bitemporal latent matching proposed in this work is better than other compared matching strategies. My main concern is about whether it is reasonable to perform the latent matching in the feature space of SAM as raised in my initial review. In the rebuttal of the authors, the authors also said "This observation shows that our method has the ability on tiny object change detection, while it is not perfect. Future works can take our method as a strong baseline and improve this point further." I feel this paper still lacks some explanation about why it chose the feature space of SAM to perform latent matching. I was not convinced by the provided examples from the rebuttal file since this work currently lacks some theoretical or intuitive explanations for the readers to better understand the motivation.

评论

Thank you for your detailed explanations. We are happy to see most of your concerns have been addressed.


generalization ability to some unseen images (as we understand, this should be a new concern in this round)

This generalization ability has been evaluated in our unsupervised change detection experiment (Table 4). Previous SOTA I3PE (Chen et al., 2023) trained a model on the SECOND dataset (this means the model has seen these images) and achieved a 43.8% F1_1 score, while our AnyChange (ViT-B) without any training (this means our model cannot see these images) achieved 44.6% F1_1 score. This suggests that our model's generalization ability on unseen images is superior empirically. After training on these images, our method achieved a 48.2% F1_1 score. This is a huge improvement compared with previous SOTA's 43.8%.


lack some grounded analysis about tiny/minor changes.

  • [no claim here] As we previously stated, hope you understand our goal is not a tiny/minor change-tailored change detection model, therefore, our experimental design naturally does not include this point. We have not claimed this point in our contributions, which is beyond the scope of this paper.
  • [we have qualitative results] Nevertheless, we considered adding this analysis in the rebuttal, however, we found no publicly available tiny change detection dataset can be used in the rebuttal period. We are compelled to only demonstrate a qualitative result. At least, we demonstrate our model can detect tiny/minor changes to some extent rather than fundamentally having no such ability.

Why it chose the feature space of SAM to perform latent matching

  • [most promising to resolve an open problem] In line 37, we explicitly stated our motivation why we built our model upon SAM. This is because SAM is the first image segmentation model with unprecedented zero-shot capability. It is most promising and natural to choose this strong visual foundation model as a base to challenge a so-far unexplored open problem because the whole community has no idea about zero-shot change detection.

  • [preliminary probing the latent space of many visual foundation models] We have been still seeking a solution for zero-shot change detection for a long time. Through extensive experiments to understand the latent space of visual foundation models, we found SAM's latent space is most promising. Meanwhile, SAM is promptable beyond other foundation models, e.g., DINOv2. This is why we have a strong baseline based on DINOv2. This is all based on our preliminary exploration. Section 3.2 shows a part of our exploration, which is sufficient to support this paper.


lacks some theoretical or intuitive explanations for the readers to better understand the motivation

  • In this round, the reviewer raised a new concern about theoretical explanations of SAM's latent space. However, this problem exceeds the scope of our paper. Besides, theoretical explanations of SAM also remain another open problem.

  • A potential intuitive explanation is that the change of object can be modeled as a semantic dissimilarity, e.g., building and bare land have large semantic dissimilarity, therefore, the transition from building to bare land belongs to a change. This is why we propose to exploit intra-image and inter-image semantic similarities in SAM's latent space to achieve zero-shot change detection.

Sincerely thank you for your willingness to engaging discussion and your acknowledgment of bitemporal latent matching. We benefit more from this insightful discussion. We will enhance our motivation part based on our intuitive explanation as you suggested. We think this is an easy-to-address issue via a revision.

We all hope to have a theoretical ground for zero-shot change detection. However, it is an unprecedented challenge due to the obstacle of multiple open problems. We need to resolve this challenge step-by-step.

评论

Thanks for the authors' very prompt response.

generalization ability to some unseen images (as we understand, this should be a new concern in this round)

I did have this concern in my initial review and here I quote:

Some minor issues: 1) The generalization ability of the proposed method to unseen images as they claim the proposed model is a foundation model.

Please check the weakness part of my initial review again.

Thanks for the authors providing more explanation about their motivation. Including such discussions will help promote this paper.

I was not fully convinced by the explanation provided by the authors. I feel the authors would take some core challenges of the problem addressed in this work when they customize SAM into the change detection rather than only considering "SAM is the first image segmentation model with unprecedented zero-shot capability."

I cannot immediately understand your motivation and design when reading this work. There is no feeling like "Wow, it is true and would definitely work" when I am reading this work. I still need to think a lot and wonder why it would work.

Hope my comments will help the authors promote this work. I by no means intend to provide some offensive comments and hope the authors could understand my rating.

评论

We apologize for our hasty response. Anyway, we hope that the response can address the concern about generalization ability.

I suddenly realized your concern about motivation has multiple levels. Thank you for your patient explanation.

  • The first level is why we chose SAM as a base model. Previously, we indeed only answered at this level. hope this level has been addressed.
  • [extreme data collection cost problem motivates our training-free design] The second level is why we designed a training-free adaptation method on SAM for zero-sho change detection. This is because it is non-trivial to adapt SAM to change detection and maintain its zero-shot generalization and promptability due to the extreme data collection cost of large-scale change detection labels (as lines 38-40 stated). Why does training-free potentially work? This is because our preliminary latent probing experiment motivated us.
  • [learning generalized change representations needs numerous labeled data. The relation between semantics and change motivates us to match in latent space] The third level is why we did matching in SAM's latent space. The motivation is that the object change is based on its semantics. Meanwhile, a good visual representation is usually well semantic grouped in its feature space, which was observed by DINOv2's feature visualization (as line 138 stated). If we want to detect semantic change without numerous labeled data, we should seek these semantic differences in their feature/latent space rather than pixel space (needed to learn a visual encoder). Therefore, we design matching in feature/latent space.

Hope this response can help you and future readers better understand the motivation of our work. Thank you for your patient help. We will carefully incorporate this discussion into our revised paper. Best wishes to you.

审稿意见
6

The authors propose AnyChange, a novel framework for zero-shot change detection in remote sensing imagery. This framework leverages the Segment Anything Model (SAM) and introduces a "bitemporal latent matching" method to identify changes between images taken at different times. AnyChange identifies changes by comparing the semantic similarities of image regions in SAM's latent space, eliminating the need for explicit training on change detection tasks. Furthermore, the model incorporates a point query mechanism that allows for interactive, object-centric change detection through user clicks. Experimental results demonstrate AnyChange's effectiveness in various change detection scenarios, highlighting its potential as a valuable tool for researchers and practitioners alike.

优点

  • One of the first works to propose zero-shot change detection of remote sensing imagery. The authors repurpose SAM for comparing satellite imagery captured at two different timesteps.
  • The authors propose bidirectional latent matching technique which compares latent embeddings of SAM mask proposals coming from the bitemporal satellite imagery. The authors empirically showed high correlation between intra-image and inter-image latent embeddings, which eventually enables detecting changes.
  • Optionally, the framework supports human in the loop point query mechanism to refine mask proposals from SAM and potentially reduce false positives.
  • Custom baselines are constructed from scratch and experiments on three change detection benchmark datasets show superior performance of AnyChange over the baselines.

缺点

  • Several components of the paper are poorly explained. SAM uses the MAE's image encoder based on ViT. The image features extracted from such a model are downsampled due to the patching effect. How is the framework able to compute pixel level features for the mask proposals? Is there an interpolation step?
  • Although the idea is interesting and novel, it seems the model can easily be fooled by small radiometric changes between the timesteps or presence of other conditions such as clouds. SAM, which is extensively trained on consumer photographs, may easily confuse seasonal changes, which may not be relevant for a task. The experiments presented in the paper are on benchmark datasets and may not reflect practical applicability of the proposed framework.
  • How does the matching algorithm handle overlapping objects with an image? Such as a tree canopy covering a part of the road. Averaging image embeddings in such cases might lead to erroneous results.

问题

  • Is the framework generalizable? Is SAM able to detect non-building related changes? All the experiments on the datasets shown are related to building change detection.

局限性

Limitations are discussed in the appendix section.

作者回复

To Reviewer eUYB

W1: How is the framework able to compute pixel level features for the mask proposals? Is there an interpolation step?

There is a bilinear interpolation to upsample the feature map back to the original image size. Then we compute the mask embedding by averaging each position's embedding of the mask proposal. This can avoid quantized errors caused by changing mask geometry to adapt the feature map.


W2.1: The model may be easily impacted by radiometric changes.

We simulate the radiometric changes by randomly applying color jitter to the pre and post-event images, respectively. The results are listed as follows:

LEVIR-CDS2LookingxView2SECOND
conditionF1 / Prec / Rec / mask ARF1 / Prec / Rec / mask ARF1 / Prec / Rec / mask ARF1 / Prec / Rec / mask AR
baseline23.4 / 13.7 / 83.0 / 32.67.4 / 3.9 / 94.0 / 48.313.4 / 7.6 / 59.3 / 27.844.6 / 30.5 / 83.2 / 27.0
w/ color jitter22.6 / 13.1 / 84.2 / 30.77.4 / 3.8 / 94.1 / 48.413.5 / 7.7 / 54.7 / 25.942.4 / 28.7 / 81.7 / 26.4

W2.2:The experiments presented in the paper are on benchmark datasets and may not reflect practical applicability of the proposed framework.

The data of these four benchmark datasets were almost collected for real-world applications, e.g., urbanization, disaster damage assessment, natural resource monitoring.

  • LEVIR-CD is designed for ordinary building change.
  • S2Looking is designed for building change under side-looking / off-nadir / different viewpoint observation conditions. The data were collected from GaoFen (GF), SuperView (SV), and BeiJing-2 (BJ-2) satellites.
  • xView2 aims to assess building damage changes, which includes 19 real disaster events with six disaster types (wildfire, earthquake, tsunami, hurricane, volcano, flooding). The data were collected from WorldView-2, WorldView-3, GeoEye-1.
  • SECOND is designed for land-use/land-cover changes, including 30 change types. The object changes involve the categories including non-vegetated ground surface, tree, low vegetation, water, buildings and playgrounds.

We have demonstrated our method on practical apllications in Figure 1. The Disaster Damage Assessment cases of Figure 1 include 2023 Kalehe DRC Flooding (first row) and 2023 Turkey-Syria Earthquake (second row) events, which are not included in any public benchmark dataset we used.


W3: How does the matching algorithm handle overlapping objects with an image? Such as a tree canopy covering a part of the road. Averaging image embeddings in such cases might lead to erroneous results.

We have experimented the mentioned case. Thanks to SAM's strong generalization on object segmentation. The result shows the tree canopy and road segments are segmented as three independent parts, please check Figure 2 in the rebuttal PDF we newly uploaded. The tree canopy or road's embedding only encodes information belonging to themselves. At least in this case, there is no erroneous result. This implies that our matching algorithm does not need to deal explicitly with overlapping objects once the visual foundation model is strong enough in our framework.


Q1: Is the framework generalizable? Is SAM able to detect non-building related changes? All the experiments on the datasets shown are related to building change detection.

Yes, our design is general for change types. The SECOND dataset we have used in experiments is a land use/land cover change dataset, including 6 land-cover classes (non-vegetated ground surface, tree, low vegetation, water, buildings, and playgrounds) and up to 30 change types, extending beyond just building-centric changes.

评论

I thank the authors for providing additional details and experimental results that help strengthen the paper. After carefully considering all the discussions, I have updated my score.

评论

Happy to see our responses addressed your concerns and glad to see you increased your score. We sincerely appreciate you and best wishes for you.

审稿意见
5

The authors address the problem of zero-shot change detection. While some models focus on zero-shot semantic segmentation, there hasn't been much work in the area of zero-shot change detection. The lack of availability of large change detection datasets makes it non-trivial to train such models from scratch using existing methods easily. The authors propose a training-free method to adapt SAM for change detection to circumvent this. They utilize the semantic space of SAM to find regions of change. More specifically, they propose "Bitemporal Latent Matching". For a given image, they extract the mask embeddings for each object proposal. They use the negative cosine similarity between two mask embeddings (at the same location at different times) as a measure of confidence of change. The region proposals are then sorted by their confidence scores and selected via thresholding. Their results show that they outperform other naive baselines for zero-shot change detection.

优点

  • The paper is well-written with only a few grammatical errors
  • The proposed method neatly avoids training making the approach quite resource-efficient.
  • While the method is focused on Change Detection in Satellite Imagery, the proposed method could have more extended implications. This paper explores the idea of utilizing intra-image and inter-image similarities in SAM's embedding space to solve Change Detection in bi-temporal satellite images. This strategy can potentially aid in solving any vision tasks that require multi-temporal inputs.

缺点

  • The paper lacks important ablation studies. For example: there is no experiment that shows the value of computing the change confidence scores bidirectionally as opposed to in a single direction
  • I think the example used in Figure 4 is not the best to demonstrate the efficacy of the model. It is hard to know if the model is segmenting all buildings or just the buildings that have changed (since all buildings are changed between the two images). It might have been better to show examples where the changes are more localized; for example: only a few buildings are missing in image 2 and the rest of the area is unchanged.

Minor Grammatical Errors

  • Ln 132-133. "we known"
  • Ln 195 "t denote"

问题

Please address the concerns that are listed as weaknesses. A suggestion from my side would be to add some more qualitative results to the paper.

局限性

The authors adequately discuss the limitations and societal impact of their work.

作者回复

To Reviewer SmiM

W1: ablation study for matching direction.

We have added this ablation study. The results are as follows:

LEVIR-CDS2LookingxView2SECOND
DirectionF1 / Prec / Rec / mask ARF1 / Prec / Rec / mask ARF1 / Prec / Rec / mask ARF1 / Prec / Rec / mask AR
bidirectional23.4 / 13.7 / 83.0 / 32.67.4 / 3.9 / 94.0 / 48.313.4 / 7.6 / 59.3 / 27.844.6 / 30.5 / 83.2 / 27.0
only from tt to t+1t+117.7 / 10.1 / 72.8 / 1.39.0 / 4.8 / 85.6 / 32.115.3 / 9.0 / 49.8 / 27.341.2 / 31.2 / 60.6 / 14.8
only from t+1t+1 to tt23.6 / 13.6/ 88.7 / 35.98.1 / 4.3 / 79.5 / 19.412.3 / 7.6 / 32.7 / 6.746.1 / 34.1 / 71.3 / 14.9

We can find that the performance of single-directional matching is sensitive to temporal order, e.g., mask AR of two single-directional matching on LEVIR-CD are 1.3% and 35.9%, respectively. This is because the class-agnostic change is naturally temporal symmetric which is exactly the motivation of our bidirectional design. This result also confirms generally higher and more robust zero-shot change proposal capability.


W2: Figure 4 does not effectively demonstrate the model's efficacy because it is unclear if the model is segmenting all buildings or just the changed buildings.

Thanks for this good suggestion. We have updated Figure 4 as you suggested, which includes unchanged and changed buildings simultaneously. Please check Figure 3 in the rebuttal PDF we newly uploaded. It is more clear to show that AnyChange can more accurately detect building changes with the help of the point query.


minor grammatical issues: we will proofread thoroughly to fix them.

评论

I thank the authors for addressing my concerns.

  • The ablation results suggest that the choice of direction can strongly impact performance (depending on the dataset) when using uni-directional matching. Hence, it helps motivate bidirectional matching
  • The updated figure is more informative as we can see that some of the the buildings that exist both at t and t+1 are not detected as change. However, it appears this is not the case when you move from 1 to 3 point queries as you see more false positives with the 3 point queries.

After carefully reading the points raised by the other reviewers and the respective rebuttals, I agree that the paper still has some concerns, especially regarding the "degree of recognized significance" as pointed out by the reviewer MDhn. Hence, I will keep my original score of Borderline Accept for now.

评论

We are happy to see our responses addressed your previous W1 and resolved W2 to some extent. Thank you for your willingness to feedback on our updated figure.


For the observation From 1 to 3 points, false positives become more.

[potential reason] This is because it achieved a stable trade-off between precision and recall. When we move 1 to 3 points, we can find more building changes are recalled, along with the recall of other changes, e.g., vegetation to bare land, near buildings (right side of the figure).

[quantitative results: overall performance obtained improvement] While the emergence of false positives leads to a precision drop, the recall rate gains. This operation (from 1 to 3 point queries) resulted in improved overall performances measured by the F1_1 score. These improvements were observed on three datasets (supported by Table 2, 3.6% on LEVIR-CD, 2.2% on S2Looking, and 3.0% on SECOND). Therefore, we think this positive trade-off is acceptable.


For a subjective concern about the "degree of recognized significance" raised by Reviewer MDhn

Anyway, we sincerely thank you for keeping your positive rating now, though you are impacted by that subjective opinion "degree of recognized significance" by Reviewer MDhn.

[objective facts: our model has a novel zero-shot change detection capability and superior performance on conventional benchmarks] Although we demonstrated our model with a novel capability of zero-shot change detection, put our zero-shot model into two conventional settings (unsupervised and supervised), and achieved new SOTA results, these objective facts seem not to persuade reviewers that our work achieved significant improvements in terms of new capability and new performance.

Objectively, as authors, we cannot and should not respond to this subjective opinion. We totally understand the difficulty of judging a work without any reference. This is why Reviewer MDhn held reservations even though Reviewer MDhn acknowledged the value of our work and his/her concerns have been addressed.

If subjective opinion can outweigh objective results and facts, we believe this is frustrating enough for all. Each of us is both a reviewer and an author.


Sincerely thank you again for your helpful comments, feedback, and warm support. Best wishes to you.

评论

Again, thank you for your prompt responses. I would quickly like to clarify my stance. The authors have addressed my original comments to a large degree and a borderline acceptance is my fair assessment of the paper at this point. There were concerns regarding the novelty and significance expressed by multiple reviewers which seem to still be outstanding. I am certain there will be a fair discussion on this topic in the reviewer-AC phase. My previous comments meant to suggest that I will be waiting on further discussion phase to make any adjustments to my current score based on these concerns (whether positive, negative, or neutral).

The authors have done a great job defending their stance in the rebuttal phase and I am certain that will have a positive impact in the upcoming discussions. Best wishes.

审稿意见
3

This paper introduces the AnyChange model, aimed at enabling zero-shot change detection in remote sensing imagery. The model builds upon the SAM, utilizing a training-free adaptation method called bitemporal latent matching. This method leverages semantic similarities within and between images captured at different times to enable change detection without additional training. The paper demonstrates AnyChange's performance through extensive experiments, highlighting its effectiveness in various remote sensing change detection tasks.

优点

  • The problem and approach proposed in this paper are highly relevant to practitioners in the remote sensing field. Zero-shot change detection could significantly impact the field by enabling more flexible and scalable monitoring of environmental and infrastructural changes.
  • The experiments presented in Table 1 are well-designed, with reasonable baselines. The methodology appears robust, and the results in Table 4, showcasing the performance of AnyChange as a change detection engine, are particularly promising for practitioners.

缺点

  • The technical contribution of bitemporal latent matching does not appear to be very high. Defining feature differences based on cosine distance between latent representations in SAM's hypersphere domain seems insufficient for a significant contribution. The novelty and uniqueness of this approach compared to existing cosine similarity-based methods, such as those in Růžička et al. (2022), are questionable. The primary difference appears to be the use of SAM’s latent representations, which may not be enough to claim substantial innovation.
    • Růžička, Vít, et al. "RaVÆn: unsupervised change detection of extreme events using ML on-board satellites." Scientific reports 12.1 (2022): 16939.
  • The paper's technical contributions seem too narrow for NeurIPS. The foundational observations Q1 and Q2 (lines 136 and 148) are only demonstrated with electro-optical images in the satellite domain, limiting the broader applicability of the proposed method to other domains or modalities.

问题

  • How did you set up the dataset and model to prove Q2? Could you provide more details on the empirical setup and the specific configurations used to validate the semantic similarities between satellite images of the same location at different times?

局限性

They point out important limitation as vague definition and not concrete benchmark dataset.

作者回复

To Reviewer MDhn

W1.1: The novelty and uniqueness of this approach compared to existing cosine similarity-based methods (e.g., RaVAEn) are questionable.

Our AnyChange is fundamentally different from existing methods.

  • Zero-shot vs. unsupervised. Existing unsupervised methods need to train a model to obtain visual representation, e.g., RaVAEn needs to train a VAE on specific data distribution to extract features, while ours belongs to zero-shot method without any training. Thus, our method is more resource-efficient, which is also appreciated by Reviewer SmiM.

  • Interactive. Existing methods (including RaVAEn) do not support interactive mode, while our method can be interactive and used as an efficient label tool (supported by Table 3), thanks to our point query mechanism which is based on our bitemporal latent matching and SAM's point prompt mechanism. This implies our method supports human-in-the-loop. This point is also appreciated by Reviewer eUYB.

  • Instance-level vs. pixel-level. Existing methods (including RaVAEn) outputs pixelwise change maps (raster), while ours produces instance change masks (polygon).

Our bitemporal latent matching differs from existing cosine similarity-based methods in three major points:

  • Computational unit (instance vs. pixel). Existing methods, e.g., RaVAEn, adopt pixel embedding, while we use instance mask embedding, leveraging SAM's advantage. Our ablation study (Table 1, SAM+CVA match vs. AnyChange) indicates that this factor is key to the performance difference.
  • Matching direction (bidirectional vs. none). Existing methods are pixel-based and do not consider matching directions. Ours is instance-based, and matching direction is a key design for instance-based methods. We reveal bidirectional matching performs better and more robust via the following ablation study (see also response to W1 of Reviewer SmiM).
  • Integration with SAM's promptability. Integration with SAM's promptability has not been explored or even made possible by existing approaches, while we demonstrated a valuable integration with point prompt by our point query mechanism.
LEVIR-CDS2LookingxView2SECOND
DirectionF1 / Prec / Rec / mask ARF1 / Prec / Rec / mask ARF1 / Prec / Rec / mask ARF1 / Prec / Rec / mask AR
bidirectional23.4 / 13.7 / 83.0 / 32.67.4 / 3.9 / 94.0 / 48.313.4 / 7.6 / 59.3 / 27.844.6 / 30.5 / 83.2 / 27.0
only from tt to t+1t+117.7 / 10.1 / 72.8 / 1.39.0 / 4.8 / 85.6 / 32.115.3 / 9.0 / 49.8 / 27.341.2 / 31.2 / 60.6 / 14.8
only from t+1t+1 to tt23.6 / 13.6/ 88.7 / 35.98.1 / 4.3 / 79.5 / 19.412.3 / 7.6 / 32.7 / 6.746.1 / 34.1 / 71.3 / 14.9

To the best of our knowledge, we did not claim the use of cosine similarity as our contribution. On the contrary, we explicitly stated that cosine similarity is a suitable choice for measuring similarity (Lines 174-175).


W1.2: The primary difference appears to be the use of SAM’s latent representations.

We anticipated that reviewers might be concerned about this point. Therefore, we included two baselines: SAM + CVA Match and SAM + Mask Match to ablate the impact of SAM’s latent. Our AnyChange and two baselines use the same SAM's latent representations, however, due to the different matching strategies, our AnyChange is superior (see Table 1 and Ablation: Matching Strategy, Line 251-261).


W2: The foundational observations Q1 and Q2 are only demonstrated with electro-optical images in the satellite domain, limiting the broader applicability of the proposed method to other domains or modalities.

  • Our method can be also used in natural image domain. Please check Figure 1 in the rebuttal PDF.

  • The idea of utilizing intra-image and inter-image similarities in SAM's latent space can potentially aid in solving any vision tasks that require multi-temporal inputs, which is appreciated by Reviewer SmiM.

  • Modality limitation comes from SAM itself rather than our bitemporal latent matching and point query mechanism. Our design itself is modality-agnostic.

  • Although our method is evaluated solely on optical satellite images, it still has a broad range of applications. The following five NeurIPS main conference papers all only focus on electro-optical images in the satellite domain.

    • [NeurIPS 2023] Cross-Scale MAE: A Tale of Multi-Scale Exploitation in Remote Sensing
    • [NeurIPS 2023] SAMRS: Scaling-up remote sensing segmentation dataset with segment anything model
    • [NeurIPS 2022] SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery
    • [NeurIPS 2021] Spatial-Temporal Super-Resolution of Satellite Imagery via Conditional Pixel Synthesis
    • [NeurIPS 2021] LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation

Q1: How did you set up the dataset and model to prove Q2?

The images were chosen from the LEVIR-CD dataset. We manually label all buildings of t2t_2 images as ground truth. The model is SAM (ViT-H) with default configurations. For t1t_1 image, we randomly select one building point as a point prompt and run SAM inference to obtain the building mask. For t2t_2 image, we adopt grid points (segment anything mode) to run SAM inference to obtain mask proposals. We compute all mask embeddings and then match t2t_2's all mask embeddings by t1t_1's building mask embedding, thus obtaining t2t_2's building masks. Similarity is measured by cosine distance. Thresholding adopts the OTSU algorithm. The F1_1 score and recall rate as metrics are reported to confirm Q2.

评论
  1. I acknowledge the points raised in your rebuttal regarding W1.1 and appreciate the clarification on the differences between AnyChange and existing methods. I agree that the aspects of "zero-shot" and "interactive" change detection, along with the shift to instance-level analysis enabled by SAM, are valuable contributions to the field. However, I still hold reservations about whether the novelty associated with utilizing SAM as a feature extractor, essentially inheriting these advantages, is sufficient to meet the bar for a NeurIPS publication. While the enhancements stemming from bitemporal latent matching and its integration with SAM's promptability are valuable, the core methodological contribution seems to heavily rely on the inherent capabilities of a pre-trained SAM model, a significant portion of which appears to be explored in existing works like [Chen et al., 2024] and [Oh et al., 2023].
  • [Chen, et al., 2024] Chen, Hongruixuan, Jian Song, and Naoto Yokoya. "Change Detection Between Optical Remote Sensing Imagery and Map Data via Segment Anything Model (SAM)." In IGARSS. 2024.
  • [Oh, et al., 2023] Oh, Youngtack, et al. "Prototype-oriented Unsupervised Change Detection for Disaster Management." In NeurIPS Workshop. 2023.
  1. Regarding W2, I agree that focusing solely on optical satellite images doesn't inherently diminish a paper's contribution. As you pointed out, the papers you listed as examples demonstrate substantial novelty by leveraging specific characteristics of satellite imagery to overcome limitations in existing research. For instance, the adaptations made for self-supervised learning and the development of a novel super-resolution model directly address challenges and exploit opportunities presented by satellite data characteristics. My concern is that the current manuscript doesn't present a similar level of contribution in advancing the state-of-the-art. While bitemporal latent matching offers a novel post-processing approach for change detection using pre-trained SAM outputs, its novelty and impact, when compared to other NeurIPS publications, seem insufficient to warrant acceptance. The manuscript’s reliance on pre-trained SAM without significantly addressing the limitations or exploring new frontiers within the context of change detection, weakens its position as a strong NeurIPS publication.
评论

We sincerely thank you for your helpful feedback and acknowledge our work's value. Appreciate! We have provided further clarification to your remaining concerns, hope it helps.

Point 1: The core methodological contribution seems to heavily rely on the inherent capabilities of a pre-trained SAM model, a significant portion of which appears to be explored in existing works like [Chen et al., 2024] and [Oh et al., 2023].

  • For a training-free method, relying on a foundation model is completely acceptable, which is widely acknowledged by our community, e.g., there are many training-free image editing methods based on Stable Diffusion. Our method depends on SAM, however, our contributions are orthogonal to this visual foundation model itself. It is through the emergence of SAM that we can formally explore zero-shot change detection, which focuses on the generalization of unseen change types and data distribution. This is one of the most important problems in change detection.

  • The two mentioned good papers indeed used SAM for change detection. However, it is completely different from our method from the task, motivation, core methodology, experimental design, and conclusion.

    • [Chen, et al., 2024] used OSM map to prompt SAM for image-map change detection.
    • [Oh et al., 2023] used SAM to extract object masks on a single pre-event image as a basic unit of voting post-processing (see their paper Sec 2.2, Refinement with SAM Model). This operation with their DINOv2-based change map is similar to what our baseline SAM + CVA match did. We will cite their paper for credit.
    • major difference 1: Neither of them exploited intra-image and inter-image semantic similarities in SAM's latent space, however, this is one of our main claimed contributions.
    • major difference 2: Neither of them achieved zero-shot instance change detection and benchmarked both pixel and instance-level performance.
    • major difference 3: Neither of them is interactive.

For Point 2, our motivation is just to clarify optical images have broader applications by giving five NeurIPS papers as evidence. Thank you for letting us know your previous concern W2 about "limiting the broader applicability" has been addressed.


For your further concern (this round) about the "current manuscript doesn't present a similar level of contribution in advancing the state-of-the-art."

  • [SOTA results] Our zero-shot performance (Table 1), unsupervised performance (Table 4), and supervised performance (see W3.3 of Reviewer n4V1) all achieved state-of-the-art.

  • To judge the contribution of research work, there is a fundamental difference between a so-far unexplored task (zero-shot change detection) and those well-established tasks (MAE-like self-supervised learning, and super-resolution).

  • For the mentioned self-supervised learning and superresolution in the satellite domain, they all have well-established baselines and benchmarks, e.g., SatMAE improved over MAE (baseline) with their design. However, MAE has given a successful roadmap, i.e., masked image modeling and benchmark method (linear probing, fully fine-tuning).

  • [new frontier] Zero-shot change detection remains an unexplored yet very important problem in the satellite domain without any well-established baselines and benchmarks. Our work provides the first zero-shot change detection roadmap, including problem formulation, baselines, state-of-the-art models (supported by Tables 1 and 4), and benchmarks.

  • Last but not least, also as a NeurIPS reviewer, I subjectively think it is impossible to judge the contributions between publications on different topics. In the topic of change detection, we can confidently claim our work is groundbreaking.


Thank you once again for your time and effort. If you have any concerns, please feel free to share them here.

评论

Thank you for your detailed and thoughtful responses to my concerns. I appreciate the further clarification provided, particularly regarding the distinction between your work and the cited papers, as well as the emphasis on the unexplored nature of zero-shot change detection in the satellite domain.

While I still have some reservations regarding the overall extent of contribution significance as I pointed out, I recognize the value of exploring and establishing a baseline for zero-shot change detection in the specific context of remote sensing. Please understand that this is not a dichotomous issue of whether there is a unique contribution or not, but rather a question of degree of recognized significance. Therefore, I will not raise further objections and am willing to conclude this discussion. I believe the points raised in both the review and the rebuttal will contribute to a balanced and informed discussion during the reviewer-AC discussion phase.

Thank you again for your efforts in addressing my concerns.

评论

Good to know our responses addressed your concerns. We also fully understand your stand and opinion. Overall, this is a very beneficial discussion for us. Appreciate and salute each responsible reviewer.

评论

We also thought our discussion had finished with a good status. Hope you understand we have to respond to other reviewers when they have remaining concerns.

We have no intention of considering what the general standard of NeurIPS is. This is always a relative concept.

[relative contributions]. We anticipated that without a reference/baseline for zero-shot change detection, it would be difficult for reviewers to make judgments. Therefore, apart from our custom zero-shot baselines, we also put our zero-shot model into a common context (unsupervised and supervised setting) to evaluate its effectiveness. Tables 3 and 4 support this point.

[community contribution]. Our model will be the anchor point in zero-shot change detection, which can effectively avoid such difficult review processes. This will benefit the change detection community.

Our focus is still on nudging the door of zero-shot change detection, one of the most important problems in the multi-temporal remote sensing community. We identified this problem and resolved this problem via an extremely clean method we recognized (SAM with our training-free adaptation). This is an important baseline in zero-shot change detection, which should not be ignored.


Is there any good case to show a study's contribution though their model performance stems from a foundation model?

Exactly, countless outstanding works follow this paradigm in the generative AI community. For example,

  • ControlNet (ICCV 2023 best paper) is a training-based adapter (essentially zero convolutional layers) for Stable Diffusion. Their performance stems from the decision to employ Stable Diffusion for conditional image generation. However, ControlNet exhibited outstanding and unprecedented control ability in their community. Please note that adding control-itself is also not a unique technical proposal in the image generation community.

Likewise,

  • Our bitemporal latent matching and point query is a training-free adapter for SAM. Our performance stems from the decision to employ SAM for change detection. However, our work exhibited superior and unprecedented zero-shot change detection and interactive capability in our community. Our unique technical insight is to induce a novel change detection capability from SAM, which is not common sense in the remote sensing community. Without our method, there is no bridge between change detection (multi-temporal task) with SAM's interactive, zero-shot, instance-level capability (demonstrated on a single image task).

We fully respect the reviewer's opinion and hold our opinion with our evidence.

Finally, good luck to everyone.

评论

While I intended to conclude my discussion on this paper in my previous comment, I have realized, after observing other reviewers' discussions, that my review and subsequent discussions with the authors have been frequently referenced. Therefore, I believe it is necessary to clarify my perspective further. I also feel compelled to add this comment for the sake of transparency and clarity, as I will maintain the following stance at the beginning of the reviewer-AC discussion phase.

Firstly, I acknowledge the authors' formulation of zero-shot change detection, the benchmarks and robustness analysis, and the technical soundness of the proposed "Bitemporal Latent Matching". Moreover, reading the discussions with other reviewers, I have solidified my opinion that the authors' contribution in this point is clear.

Secondly, despite this, and despite the discussions I have had with the authors, I still believe that the majority of the advantages of the 'interactive', 'automatic', 'zero-shot', and 'instance-level' change detection approach, which constitutes a significant portion of this paper's strengths, stem from the decision to employ SAM for the change detection task, rather than being unique technical proposals of this research. This decision itself is unprecedented (though related works exist) and holds novelty not found in existing literature. However, the authors and I have a fundamental disagreement on whether this aspect can be considered a sufficient technical contribution to the field of change detection that meets the general standards of NeurIPS.

作者回复

We are sincerely grateful for the reviewer's efforts and their constructive feedback. We appreciate the reviewers’ acknowledgment that

  • [pioneer] Our work is one of the first works to propose zero-shot change detection of remote sensing imagery (eUYB).
  • [novel] Our work is novel and interesting (eUYB, n4V1).
  • [broad impact] Our idea can potentially aid in solving any vision tasks that require multi-temporal inputs (SmiM).
  • [solid] comprehensive experiments (n4V1, MDhn)
  • [practical] robust methodology and promising results for AnyChange as a change data engine (MDhn, eUYB)

Our work aims to resolve an open problem: zero-shot change detection, which has not been explored so far in the literature. Our contribution includes:

  • [new task] Our work is the first to provide a problem formulation, benchmarks, and models for zero-shot change detection.
  • [new roadmap] We propose a training-free roadmap built upon SAM to achieve zero-shot change detection and point out its key designs.
  • [new tool] Apart from superior zero-shot performance, our AnyChange model is promptable/interactive/human-in-the-loop and thus can be used as a more efficient change label tool.
  • [new results] We demonstrate our AnyChange with better zero-shot performance, SOTA unsupervised change detection performance, and comparable supervised change detection performance.

We have detailed point-by-point responses to each reviewer, you can check our rebuttal in your Ofiical Review section. We also provide a PDF file containing four figures to better address your concerns. Please remember to download it.

最终决定

The authors introduce the AnyChange model, which enables zero-shot change detection in remote sensing imagery. This model adapts the Segment Anything Model (SAM) using a training-free method called bitemporal latent matching, leveraging semantic similarities within and between images to identify changes without additional training. The authors demonstrate the effectiveness of AnyChange through extensive experiments, highlighting its potential as a valuable tool for remote sensing change detection tasks.

While the idea is intriguing, several reviewers raised concerns about the paper's novelty, methodology, and experimentation. Specifically, they questioned the applicability of the proposed method to small-size change detection, among others, citing a strong basis for their criticism. The authors addressed these comments during the rebuttal phase, but this did not result in a significant improvement in the overall rating from the reviewers.

Although I agree with the authors' claim that zero-shot change detection is first-time introduced in their work, overall novelty is questionable since the method proposed in this paper is a specific case of change vector analysis (or clustering) with vectors extracted from SAM embeddings and region proposals.