Towards flexible perception with visual memory
We build a simple visual memory for classification that scales to the billion-scale regime, enabling a number of capabilities like unlearning and attributing model decisions to datapoints.
摘要
评审与讨论
This paper introduces a retrieval-based visual memory framework, challenging the traditional paradigm of deep learning models storing knowledge in static ("stone") weights. Instead, it separates representation (via pre-trained embeddings) from memory (through fast nearest-neighbor search), enabling a dynamically editable model. The proposed method offers a scalable and flexible solution to lifelong learning, unlearning, and dataset pruning. By treating classification as a retrieval problem, the authors demonstrate near state-of-the-art performance on large-scale datasets (ImageNet, JFT) with benefits such as control on the decision-making. This approach is presented as a significant step toward rethinking how knowledge should be stored in deep learning, moving away from the limitations of static weight-based models.
给作者的问题
Regarding the last point (m.2), one question concerns whether this method could be used to remove biases by targeting high-level concepts rather than merely individual data points.
Another query would be can we understand a data point influence as the number of time it act / is used in the decision ?
Related to this, could this score be used to select better quality/diversity on the dataset ?
论据与证据
The retrieval-based approach enables truly flexible model updates, allowing knowledge to be added or removed without retraining. The authors demonstrate that new classes can be integrated, and unwanted data points can be unlearned by removing them from memory. The method also scales effectively, achieving strong ImageNet top-1 accuracy, with Gemini re-ranking further enhancing performance.
However, I remain unconvinced by the claim that this approach inherently leads to interpretable decision-making. While retrieving nearest neighbors provides insight into what influenced a prediction, interpretability should go beyond visualization. I would like to see a falsifiable human experiment where a human can predict how a modification to the memory will affect the model’s behavior and verify whether the system responds as expected. This kind of simulatability (as discussed in works by [1,2,3]) would provide stronger evidence that the model is genuinely interpretable in a way that users can act upon.
That said, aside from this concern, I found that the empirical results convincingly support the paper’s core claims.
[1] Finale Doshi-Velez and Been Kim. “Towards a rigorous science of interpretable machine learning”
[2] Julien Colin, Thomas Fel, Rémi Cadène, and Thomas Serre. What i cannot predict, i do not understand: A human-centered evaluation framework for explainability methods.
[3] Peter Hase and Mohit Bansal. “Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior?”
方法与评估标准
Yes, except for the interpretability claim, the experiment section is extensive and extremely well done.
理论论述
Yes, no issue.
实验设计与分析
The experiments cover large-scale datasets (JFT, IN), analyses of memory size versus accuracy, ablations on retrieval and demonstrations of unlearning and dataset pruning.
补充材料
Yes, all.
与现有文献的关系
Excellent, I wish more paper would have such an excellent related work section.
遗漏的重要参考文献
Nothing in my opinion.
其他优缺点
I should say that this is one of the most exciting papers I’ve seen this year in this space. It opens up an entirely new direction that allow flexible, "interpretable?" and more importantly controllable vision decision. The work is well-executed, highly relevant, and has enormous potential for future extensions (e.g., concept-based unlearning, fairness improvements, multimodal retrieval, see my questions).
With minor improvements, this could become a foundational paper in AI research.
To summarize, in the strenghts, I found that:
- the method proposed scale efficiently to billion-scale datasets while maintaining strong performance, making it highly practical for real-world deployment.
- controllable: decisions are based on explicit memory retrieval, allowing users to inspect and potentially modify model behavior, though further validation of interpretability is needed.
- unlearning and dataset pruning made simple: removing knowledge is as simple as deleting data from memory, attacking a key challenge in unlearning and data curation.
- strong generalization across domains: the model demonstrates impressive out-of-distribution robustness, suggesting retrieval-based architectures can better adapt to changing data distributions.
However, even if i really like this work, I think that it could be improved. Here are, in my opinion, the weak points of the paper, which I will group into major problems (M) and minor problems (m). Major Concerns (M):
- M.1. Claim on Interpretability: see my previous point, but i think that the lack of formal user study if a major point if you claim interpreatbility. The model's explainability seems intuitive, but without user studies, it is uncertain whether humans can predict or act on model behavior effectively.
- M.2 No Discussion of shortcut learning and bias removal: I think the first extension or application i would think of would be on removing shortcuts. The paper does not address whether retrieval-based learning can still exploit dataset shortcuts (I would tend to think yes), nor whether it can actively remove biases.
- M.3 Dependence on DinoV2: DinoV2 is an exceptionally strong vision model, particularly effective for retrieval tasks due in part to the Koleo loss—which was specifically choosen by the authors of DinoV2 for this purpose (see: https://github.com/facebookresearch/dinov2/blob/main/dinov2/loss/koleo_loss.py). This raises the question of how much the results depend on the model's embeddings. Could alternative losses, or even different model architectures, be designed to perform even better?
Now for the minor (m):
- m.1 Robustness to adversarial: do you think that the effect of adversarial perturbations on retrieval remains true ? I would tend to say yes but i am curious.
- m.2 A final remark, more of a suggestion for discussion: should removal or unlearning occur at the individual data point level or at the concept level? In other words, when seeking for interpretable decisions, should interpretability be considered at the data point level or at the conceptual level? Could we instead compute distances in an overcomplete concept space [1] and design a more fine-grained decision-making process within this space? This approach would mean modifying only part of a point’s embedding rather than entirely removing a data point. Additionally, storing a sparse embedding of a point could be significantly more efficient and effective—for example, TopK sparse autoencoders (SAEs) have demonstrated the ability to reconstruct DINO representations with nearly 80% R^2 using as few as 10 concepts[2].
[1] Towards Automatic Concept-based Explanations
[2] Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models
其他意见或建议
See my previous points.
Dear Reviewer xHyy,
Thanks for your review, we’re humbled to hear you appreciated the “extremely well done experiments” and found the work “highly relevant with enormous potential for future extensions”, & “one of the most exciting papers I’ve seen this year in this space” that “could become a foundational paper”.
Falsifiable experiment on interpretability:
We fully agree that interpretability should go beyond visualization. First of all, we don’t mean to imply our system is “fully interpretable”, rather that it is “more interpretable” compared to a standard black-box model (and we will revise the writing to make this distinction clear). Secondly, we appreciate your point that interpretability-related claims and statements are best made through a falsifiable human experiment. We outlined a possible experiment at https://ibb.co/k2jnGH85; it’s aimed at quantifying how much, if at all, a memory-based system improves interpretability as operationalized by helping humans predict model behavior. Before running the experiment, we’re keen to hear your thoughts!
Shortcut learning / bias removal:
We’re happy to add a discussion on the connection to shortcut learning and bias removal. If the image encoder exploited a shortcut during training, this will influence image similarity and thus nearest neighbor selection. There are cases where bias removal is possible, and cases where it is impossible:
- [removal impossible] If the encoder is biased towards textures, a test image of “cat shape + elephant texture” [cf. Geirhos et al. 2019, Figure 1] would pull up elephant nearest neighbors, and removing all elephants from memory would come at an unreasonably high cost (not being able to identify elephants anymore). Here, encoder-level debiasing is necessary.
- [removal possible] If only a part of the memory is biased, memory-level debiasing is feasible. If “fingers” are shortcut predictors for “fish” (due to a dataset bias from proud fishermen holding their catch into the camera, cf. Brendel et al. 2019, Figure 3), then this bias could indeed be rectified by removing the biased “fish+finger” subset from memory. Afterwards, images with “fingers” would no longer lead to “fish” nearest neighbors, demonstrating successful bias removal.
DinoV2 / Could alternative losses, or even different model architectures, be designed to perform even better?*
We agree that DiNO-V2 is a strong vision model, facilitating retrieval. Since our approach is modular, any encoder can be used in a plug-and-play fashion within our framework. Progress on visual representation learning (architectures, losses etc.) can generally be expected to improve retrieval. For example, text-to-image generative models have proven to learn useful representations for open-world, open-vocab tasks and they could become possible image encoders for retrieval-based tasks as well and better losses & architectures developed by the field will directly translate into better visual memory systems.
Adversarial perturbations:
We share your intuition. As long as adversarial attacks can fool featurizers, neighbor selection can be attacked through perturbations as well (e.g., making model think a dog image is instead an airplane, and thus retrieve airplane neighbors leading to misclassification). Neighbor selection is nondifferentiable, but black-box or surrogate-model based attacks (e.g. replacing neighbor selection with differentiable Gumbel-Softmax) will likely work.
Suggestion for discussion: should removal or unlearning occur at the individual data point level or at the concept level?
Concept-level manipulation (adding/unlearning entire concepts) is a really exciting idea that we hadn’t considered so far! Since any similarity space can be used for neighbor selection, including ACE or SAE-based spaces, this is indeed possible. Depending on the use case, both data-based and concept-based manipulation can be desirable: if a single image is corrupted / has licensing issues then unlearning the individual image is the right choice; if a concept is biased then concept-level changes seem preferable. We’ll be happy to add a brief discussion on this exciting possibility.
Data point influence = number of times it's used in the decision?
Yes. In the extreme case, if a sample never shows up as a neighbor, its influence would be zero. Of course, influence isn’t always a good thing - it really depends on whether the sample contributes to (= influences) a correct or a wrong decision. Fortunately, in contrast to a traditional model, that’s very easy to keep track of in a memory model, and can thus be exploited for reliability-based weighting as in Table 3.
Could this score be used to select better quality/diversity on the dataset?
Likely. For ImageNet-A, in ~40% of cases the DinoV2 label was assessed as being better/more suitable than the original dataset label, thus this could be used to identify label issues and thereby improve dataset quality.
Super clear and thoughtful. quick reactions to each point:
- interpretability: your experiment idea looks great, and i appreciate the more careful wording.
- shortcut/bias: the examples is nice and help, and the encoder vs memory-level distinction makes sense.
- dinov2: I agree with the modularity, still i would have loved to see any ablation trying to explain why some model/loss leads to better model for this configuration
- adversarial: agreed, makes sense and glad you addressed it realistically.
- concept vs point unlearning: really cool idea, glad you’re open to it—would be nice to mention it.
- influence/data quality: useful point, and i would love to see follow up on this
Overall, i still think it’s a great paper and wish the authors best of luck with the acceptance!
Dear Reviewer xHyy,
Thanks for your response and your input!
Regarding the human experiment testing whether humans can better predict model behavior with a memory system, as opposed to a standard black-box model: We now completed the experiment exactly as described in https://ibb.co/k2jnGH85.
Given 4 label choices (guessing accuracy 25%), human accuracy is 56% in the case of black-box predictions (no neighbor information). With access to four nearest neighbor images from our memory-based system (just the neighbor images but not their labels), human accuracy is at 83%. This represents an absolute improvement of +27% and a relative improvement of +67% in human prediction accuracy, providing strong, falsifiable evidence in favor of the statement that a memory-based model is more interpretable.
Thanks again for suggesting this excellent experiment, which we will incorporate into the camera ready version.
Experimental details:
- accuracy difference is statistically significant (p < 0.001)
- featurizer = DinoV2 ViT-L14 (i.e. the best performing model)
- dataset: randomly selected ImageNet-A test images
- nearest neighbors for condition B: from ImageNet-train
- 4 label choices per trial including ground truth label, model-predicted label (if different), and the remaining 2-3 labels were plausible alternatives based on top CLIP predictions for the test image. Label order randomized.
The authors tackle the question whether splitting the ‘knowledge’ a neural model has to acquire into 1) a (small) set of learnt parameters (encoder) combined with 2) a non-parametric memory can prove superior to the classic “only-learnt” parameter approach – somewhat akin to what has been successfully pursued in the language domain. They demonstrate that approaching the image classification task via a similarity search over a large datastore can indeed yield a range of advantages, especially around flexible addition and removal of specific samples – as well as better attribution during decision making which improves interpretability.
给作者的问题
- Many other datastore-based approaches use approximate kNN search; I’d like the authors to comment on whether they think their findings would roughly translate to approximate retrieval (e.g. at recall 0.85+), or whether this could severely compromise results;
- Figure 3 starts at 1K memory, showing the performance for 1 sample per class; This is highly dependent on the sample, so I’d like to know if the authors have tried to use one or a few prototypes instead; and if so, how this would affect the performance?
- Do the authors have any insights/hypotheses which other visual tasks might benefit most from a datastore, beyond image classification and generation? Could there be ways to leverage this e.g. for applications like segmentation, detection, and the like?
More on the ‘experience’ side:
- Given that encoders trained on large-scale data in a self-supervised manner capture a vast variety of data characteristics, I’d like to know whether the authors encountered any degradation when moving to slightly more specialised datasets like iNaturalist – or potentially even more specialised ones like Medical Applications, Satellite Images, etc;
Do the authors have an intuition how translatable the features still are, and whether an automatic ‘uncertainty’ threshold (e.g. based on distance or distribution of neighbours) would be useful to detect issues in generalisation?
TLDR; The paper does present a number of valuable insights (especially the rank-voting strategy and its associated behaviour), hence my rating – however, as previously mentioned throughout the other parts: I do perceive many of the findings as somewhat unsurprising given previous works and the related success of retrieval-based methods in the language domain, as well as the history in 'classical' computer vision (based on simpler features) -- making the 'novelty' part in terms of contributed insights rather limited;
This is obviously a subjective experience, but: As mentioned previously, some quite interesting analyses and (more) surprising findings are placed/hidden in the appendix – and might deserve a bit more ‘spotlight’, or at least a reference/hint in the main paper.
Update post-rebuttal:
Main questions have been addressed; Raising my score from 3 to 4
论据与证据
The major claims around flexibility of data addition & removal, as well as improved interpretability are justified and substantiated by evidence/insights throughout the experiments;
Only slight criticism would be that depending on the kind of datastore used, adding and/or removing samples might require a rebuild of the index/tree/graph – a caveat which could be worth mentioning, although it is likely amortized when compared to any other approach that requires retraining.
All claims around simplicity combined with performance are certainly well justified and substantiated.
Regarding the title: "Perception" might be a bit overclaiming, as the only task which is demonstrated here is 'image classification'; and it is debatable if this alone justifies 'perception'.
方法与评估标准
While the selection of methods as well as alternatives for ablation purposes is well chosen to foster simplicity, the evaluation is exclusively concentrated on image classification.
This is a valid choice, but somewhat limits the insights that can be gained: the ability to classify images based on their nearest neighbours is pretty well known, and has been extensively used across many tasks (e.g. few-shot learning via prototypical networks), even going back to 'classical' computer vision problems based on SIFT and other (simple) features; so the insights how having access to this much larger datastore could impact other (more complex) applications like generation would have been desirable.
理论论述
No specific theoretical claims present in this work.
实验设计与分析
As previously mentioned, the limitation to image classification is somewhat understandable to provide more detail on ablations of individual components like datastore size and voting metric, but unfortunately also limits the insights gained.
Given the previous use of kNN for classification in the literature (albeit with smaller sets of reference samples), it is not extremely surprising that this works well; especially since a powerful encoder like DinoV2 is used – which is explicitly trained to capture various attributes in images based on similarity, and known to provide good and expressive feature representations.
Personally, I found the appendix of the paper much more insightful:
Appendix G demonstrates the dependencies between attributes of a related but novel class/species and other classes in the dataset – and how step-wise addition of other exemplars of the same species influences the classification across all levels of the taxonomic hierarchy.
Appendix P shows a compositionality analyses, which might also spark new ideas for the use of nearest neighbours for more complex multi-object visual settings.
The ablations are, however, well chosen and provide sufficient insights into the crucial components of the approach.
补充材料
The supplementary material in the form of the appendix nicely complements the manuscript and provides not only additional results but, as previously mentioned, entirely new insights that I think would deserve more visibility – especially Appendix G (and to some extent O & P).
I have not checked the code.
与现有文献的关系
Relation to broader literature is established, both in the introduction as well as related works section – however, the authors could improve in actually expressing what is different in their own work, as the related works section is currently mainly listing other works w/o contrast to this manuscript’s proposed approach.
遗漏的重要参考文献
None that come to mind, the related works section provides a top-level but sufficiently broad list of related areas & works;
其他优缺点
Strengths:
Originality & Significance:
- Replacing the classic parameter-based neural memory through similarity-based search is an important area that has shown promise in the past as well as more recently in other fields like language; so this work provides a timely analysis in the vision space
- Analyses across aggregation methods as well as influence of #neighbours provides helpful insights for future methods building on similar structures (e.g. consistency in performance of aggregation schemes demonstrated in Tables 4-8)
- Fig 3. / Section 3.2 supports the findings obtained in the language domain, i.e. small model with larger memory can be competitive to bigger model
Clarity:
- The paper is well written and easy to read and follow; many additional supporting analyses moved to the appendix, so the paper provides a good level of depth to easily follow
Weaknesses:
- One key weakness that, however, seems unavoidable is the reliance on a pre-trained encoder to compress the raw images for efficient similarity search; This always raises the question about the applicability in situations with large domain gaps and actual open-world settings where genuinely ‘novel’ designs and/or materials are encountered.
- Analysis in terms of kNN is mainly focused on the ‘ideal’ recall setting; However, many other popular kNN retrieval algorithms perform approximate retrieval; See questions
- Although 'perception' is claimed in the title, the only task demonstrated in the paper is image classification.
- Main difference to other 'classical' kNN-based methods is the use of a large datastore and powerful encoder -- both, however, are known to work well in other closely related areas like e.g. NLP; so the number of 'novel surprising' insights is quite limited given the 'classification-only' setup of the experiments
其他意见或建议
Comment: I personally really enjoyed the analysis on iNaturalist presented in Appendix G as it provides genuinely ‘new’ and interesting insights; as well as the compositionality analysis (App. P).
Dear Reviewer 4C32,
Thank you for your helpful comments. We’re happy to hear you found our work “well justified”, appreciated the “insightful experiments & valuable insights” (even though some of them were admittedly a bit buried in the appendix), and described it as “a timely analysis in the vision space”.
Beyond classification, which other visual tasks might benefit from a datastore? Ways to leverage for segmentation, detection?
It’s possible to extend the approach to other tasks. One can pool features into multiple embedding clusters instead of a single cluster using classical or learned clustering methods. As a proof of concept, we tested object segmentation based on a visual memory of DinoV2 features; visualized here: https://ibb.co/92dM0B2. This puts features from a single image into memory (a car in the example) based on 8 feature clusters and uses these to identify similar features in the second (test) image, thereby creating a segmentation mask. Such multi-vector representations of images (which can be expanded to image + text) are a natural extension to our work, enabling tasks such as object retrieval or detection from images containing multiple objects, or provide coarse patch-level semantic segmentation. Finer-grained segmentation masks can be obtained with further training of the pooling methods. We hope this provides an intuition for how other tasks can be approached, and we will add the proof of concept example to the appendix. If other tasks are tackled, our main argument about the increased flexibility of a memory-based approach (and the benefits / capabilities enabled by this flexibility) directly translate to those tasks as well (& could be plugged into canonical detection architectures like Fast-RCNN).
Appendix nicely complements the manuscript but provides new insights that would deserve more visibility – esp. Appendix G (& to some extent O & P):
Thanks for the comment, that’s great to know. Since ICML's camera ready version allows for an additional page, we’re able to move the iNaturalist experiment (app. G) to the main paper as well as reference & hightlight the other two sections more than we currently do.
Changing samples might require a rebuild of the index:
Agreed, we will add this to the limitations section (page 8). Regarding the two search approaches described on page 3, adding/removing samples is trivial for approach #1 (GPU/TPU matmul e.g. on ImageNet-scale data; here one can add or drop a row from the num_images x num_features matrix) but for approach #2 (scalable nearest neighbor search index for JFT) this would indeed require adapting the index, though the amortized cost is low.
Related work: suggesting to also express what’s different.
Valid point, we’re happy to incorporate this.
Reliance on a pre-trained encoder:
Indeed, current encoders are limited. Since our approach is modular, any encoder can be used: as soon as more general, open-world encoders are developed they can simply be plugged in; even encoders based on generative models could become a possibility (given that generative models are a lot more open-world, open-vocab). Thus our approach is orthogonal to encoder choice. That said, even with current encoders we see promising success based on adding novel classes (NINCO experiment in Section 3.1).
Innovation:
While we provide technical improvements like RankVoting, we fully agree that we build on well-established methods with a long history in ML and seek to be very transparent about this throughout the paper. Instead, our focus is a broad evaluation of flexible capabilities (including attribution, flexible adding/removal, flexibly increasing granularity like on iNaturalist, …). We believe there is community interest in seeing solid evaluations.
Approximate retrieval:
Great question! Approximate retrieval increases retrieval speed at the cost of recall errors. Based on data from Appendix D, approximate retrieval would not significantly degrade results. Let’s assume approximate retrieval leads to different neighbors (compared to NNs from exact retrieval). In the best case, those neighbors are still from the same class; thus nothing changes. Worst case, they’re from a different class i.e. their label is now misleading. Based on Figure 9, we know that our approach can handle up to 60% label corruption (!) without degradation of performance; thus as long as k>>1 our approach is very robust to approximate retrieval. We’ll add this discussion.
Would an automatic ‘uncertainty’ threshold (e.g. based on distance) be useful to detect generalisation issues?
Based on the analysis from Appendix H, OOD data does indeed lead to higher mean+median nearest neighbor distances; thus a distance threshold would work well. There are a few works on kNN-based outlier detection (e.g. https://proceedings.mlr.press/v162/sun22d/sun22d.pdf).
Kindly let us know if you have any further questions, and thanks again for the great suggestions.
I'd like to thank the authors for their detailed and well-structured response.
My main questions have been addressed.
While the concern in terms of novelty of the underlying method I expressed in my review still remains, the authors do a great job in providing detailed insights into a variety of aspects -- hence, I do think this paper is a valuable addition to the conference, and I've updated my rating accordingly.
Thanks for letting us know and for increasing your score, we appreciate it!
The authors observe that it is hard to edit knowledge acquired by deep models during training, because this knowledge is encoded in a vast number of interconnected weights. To address this issue, they suggest keeping a pre-trained model frozen, and enhancing it with a visual memory and a simple KNN algorithm to make classification decisions; the visual memory essentially corresponds to a database of feature vectors from the frozen pre-trained model. Given such a visual memory, the knowledge used by the model for classification decisions can be edited as easily as entries can be added or removed from a database. Also, the authors explore different ways to aggregate labels from the k nearest neighbors of a query, proposing a ranking-based weighing strategy.
The authors acknowledge that similar systems already exist in the literature, however, they conduct multiple new experiments to highlight a number of capabilities, aiming to show the relevance of such a system in modern applications, and inspire further research in this direction. In the experimental evaluation of the method, the authors use ImageNet-1K (IN-1K) as the default visual memory, and show that visual memory can handle out-of-distribution (OOD) samples (they use NINCO dataset) without harming performance on existing classes, while memory can efficiently increase to billion-scale data. They also show that the influence of memory entries can be controlled through hard or soft pruning, which corresponds to weighing memory entries based on offline estimation of their impact in classification decisions. In addition, the authors show that classification decisions can be interpreted by inspecting the k nearest neighbors in the visual memory, along with their aggregation weights. Finally, the authors provide a number of additional experiments in the Appendix, where they further explore the behavior of their system, e.g., hierarchical classification, and prediction calibration.
update after rebuttal
The authors addressed most of my review comments, so I increased my score from 2 to 4.
给作者的问题
When soft pruning is used, if new samples are added to the database, should pruning weights be calculated again?
论据与证据
- The authors motivate their research by mentioning in the Abstract, “Training a neural network is a monolithic endeavor, akin to carving knowledge into stone: once the process is completed, editing the knowledge in a network is nearly impossible, since all information is distributed across the network’s weights.” I think there is truth in this statement, but at the same time the phrase “nearly impossible” is pretty strong, because there are plenty of PEFT methods (e.g., LoRA) to address domain shifts and diverse downstream tasks, and multiple alignment strategies (e.g., DPO) of variable complexity.
- In Section 3, the authors essentially dedicate one subsection for each of their claims, which makes for a very well organized presentation. In general, the experiments are well designed, and offer evidence for the corresponding claims. However, I think there is an issue with novelty. The idea of visual memory combined with a KNN classifier exists in many works that the authors cite, e.g., [1], so, I am not denying that there is value in additional elaborate experiments, but it doesn’t seem a major contribution, especially since KNN classifiers usually underperform compared to linear probes [2], and don’t generalize to diverse downstream tasks, e.g., segmentation and object detection. At the same time, the authors introduce a novel idea with Gemini re-ranking, but it is not really explored in the paper, even if it gives promising results in Table 2.
[1] Nakata, Kengo, et al. "Revisiting a knn-based image classification system with high-capacity storage." European conference on computer vision. Cham: Springer Nature Switzerland, 2022.
[2] Oquab, Maxime, et al. "Dinov2: Learning robust visual features without supervision." arXiv preprint arXiv:2304.07193 (2023).
方法与评估标准
In most experiments, the authors evaluate different configurations of their own model. For example, in Table 1, they evaluate the relative performance of different aggregation methods, and the in- and out-of-distribution performance before and after expanding the visual memory. I think this is fine to demonstrate a capability, but to demonstrate its impact in the broader literature, I think It's important to have baselines, like a zero-shot classifier similar to the one from CLIP or a linear probe, especially since linear probes tend to perform better than KNN classifiers, as can be seen in Table 2, where the linear probe baseline outperforms the KNN classifiers without the Gemini re-ranking. Of course, a baseline like linear probe will come with the extra cost of training compute, but compute measurements can be included as well, since it is good to know the performance-compute trade-off between different relevant methods.
理论论述
There aren't any proofs or theoretical claims.
实验设计与分析
In general, the authors make a good effort to isolate the effect of different factors in order to reach stable conclusions. For example, in the pruning experiments (Section 3.5), as explained in Section J in the Appendix, they try to mitigate the effect of , so they can attribute differences in the behavior to pruning.
补充材料
The supplementary material offer valuable additional experiments, analyses, and details. Some comments:
- In Algorithm 1, all rows have index 0. Also, it’s not hard to understand the gist of the algorithm based on the pseudocode, but I think it would be useful to have a text description walking through it. For example, the algorithm returns “label_at_level = 0”, where I guess the “= 0” is used to emphasize that the returned label corresponds to the species level (last level), but why not just returning “label_at_level”, and also, shouldn’t the algorithm return a list with labels from all levels (in Fig. 11 classification is made at all levels)?
- In addition, about Algorithm 1, why not using KNN the same way it was used at the rest of paper? meaning finding the k nearest neighbors and aggregating labels for each label level based on these neighbors? The need to use such an algorithm doesn’t harm the generalization capability and the simplicity of the proposed KNN classification approach?
- Section J: How is the reliability factor selected?
- Figure 11: It’s not entirely clear to me what is the dotted line. In the caption is mentioned, “The black dotted line indicates baseline accuracy from predicting the majority class”, does this mean that it corresponds to plurality voting instead of ranking? In general, in some experiments (e.g., Section G) it is not mentioned what is the aggregation method used, is it correct to assume that the default method is rank voting?
与现有文献的关系
The authors cite a number of highly related works, e.g., ln 25-28, col 2, which indicate that the combination of a visual memory with a KNN classifier already exists in the literature. I think the main contribution is a set of new experiments to highlight the capabilities of the proposed system, and to show that it is relevant to modern applications.
遗漏的重要参考文献
Nothing to add.
其他优缺点
The manuscript is very well written, with clear Figures, Tables and captions. The authors sometimes have a playful tone, which I personally don’t mind, and I even find refreshing.
其他意见或建议
- ln 37, col 2: “ it has seven desirable capabilities”, Section 3 discusses 6 capabilities.
- ln 94, col 2: In the definition of , I think and should be a tuple .
- In Section 2.1., the authors describe visual memory entries as feature maps, and in Section 2.2., as feature vectors (ln 149, col 1); I don’t think these terms should be interchanged, especially since the only distance metric used is cosine similarity, which requires feature vectors.
- ln 150, col 1: Next to there is “((” instead of “)”.
- ln 242, col 2: “test whether smaller models larger memory”, I think “with” is missing; similarly, in ln 273, col 1, something seems off with the phrase “increasing memory model size”.
Dear Reviewer Em8N,
Thank you very much for your detailed review. We’re glad to hear you appreciated the “well-designed experiments”, “very well organized presentation / clarity” and “very well written manuscript”.
Abstract: “nearly impossible” is pretty strong
Noted - we’ll change this to “editing the knowledge in a network is hard”.
Generalization to other tasks like segmentation / detection:
It’s possible to extend the approach to other tasks. One can pool features into multiple embedding clusters instead of a single cluster using classical or learned clustering methods. As a proof of concept, we tested object segmentation based on a visual memory of DinoV2 features; visualized here: https://ibb.co/92dM0B2. This puts features from a single image into memory (a car in the example) based on 8 feature clusters and uses these to identify similar features in the second (test) image, thereby creating a segmentation mask. Such multi-vector representations of images (which can be expanded to image + text) are a natural extension to our work, enabling tasks such as object retrieval or detection from images containing multiple objects, or provide coarse patch-level semantic segmentation. Finer-grained segmentation masks can be obtained with further training of the pooling methods. We hope this provides an intuition for how other tasks can be approached, and we will add the proof of concept example to the appendix. If other tasks are tackled, our main argument about the increased flexibility of a memory-based approach (and the benefits / capabilities enabled by this flexibility) directly translate to those tasks as well (& could be plugged into canonical architectures like Fast-RCNN).
Baselines, e.g. Table 1:
We fully agree on the importance of baselines. Due to space limitations, some comparisons were moved to the appendix, like Table 9 (RankVoting 79.9%, CLIP zero-shot 75.3%). We’d be happy to mention and display baselines more prominently in the main paper. For Table 1, we did not include a comparison to linear probes since the goal of the table is to show a lifelong learning evaluation, i.e. understand which performance can be reached on OOD data without re-training anything. If NINCO classes are evaluated with a DinoV2 model and its default linear classifier, the performance would be 0.00% because the classifier is static and cannot transfer what it has learned without further fine-tuning and change of architecture/layer.
Novelty:
Agreed, while we provide technical improvements like RankVoting, we build on well-established methods with a long history in ML and seek to be very transparent about this throughout. Instead, our main focus is a broad evaluation of flexible capabilities as mentioned in your review. We believe there is community interest in seeing solid capability evaluations (as e.g. evidenced by xHyy describing it as “one of the most exciting papers I’ve seen this year in this space”).
How is the reliability factor selected?
It’s directly related to the number of times the training image contributed to a wrong decision on ImageNet-train; see https://ibb.co/2320r1Ld.
What’s the dotted line in Fig 11?
It highlights a chance accuracy baseline (no aggregation, just constantly predicting a single class). For balanced datasets, baseline guessing accuracy can be calculated as 1 / num_classes. Since iNaturalist is unbalanced, this could be misleading. E.g. if a dataset has just two classes, but one of them accounts for 70% of samples, then constantly predicting this class would lead to 70% accuracy. Thus a commonly used “strong guessing” baseline for unbalanced datasets is to always predict the largest class (with the most samples over the entire dataset), without encoder/aggregation. We’ll update the description to make this clear. We could add DinoV2 linear probing as well, though it might struggle to train properly given just a handful of training samples.
Section G aggregation method?
Since we’re adding neighbors to memory starting from 0 exemplars, no aggregation is used here (k=1); we’ll make sure to mention this.
Algorithm 1:
We made a mistake here - Alg. 1 corresponds to an algorithm that we tried, but it didn’t work better than the much simpler kNN-based classifier with k=1. We should have removed the algorithm, apologies for the oversight. Fig. 11 indeed already corresponds to the simple kNN classification approach that you suggested we try instead.
If new samples are added, should soft pruning weights be calculated again?
Since those new samples don’t have reliability weights, their reliability would indeed need to be estimated, unless a “default reliability weight” (e.g. mean reliability of existing samples) is used as a proxy. If the new samples are IID, existing weights for existing samples wouldn’t change systematically, thus those can be kept without recalculating.
Other comments / suggestions:
Excellent points, thank you!
I would like to thank the authors for their detailed reply. The concern I expressed about novelty remains, but I think all other points from my review are addressed, so, I will increase my score, recommending this work for publication.
Dear Reviewer Em8N, thanks for getting back to us - we're glad to hear we were able to address important points from your review, and we believe the manuscript improved as a result of your helpful feedback.
Instead of fine-tuning to adapt your image classifier to a new domain, use a kNN on your new images with distances provided by the network.
As reviewers point out, the idea is not new. At the same time, it is also not popular, with fine-tuning dominating the field and this kind of method hardly even being used as a baseline. This is despite the fact that, as authors point out, the method works well. Perhaps the tagline of "visual memory" will help it become more popular.
One could imagine many practical applications that would benefit from adopting such a method. For example, ones where memory editing is crucial. Popularizing it can only help the field along.
My two main suggestions to the authors would be to 1. highlight other advantages of the method which are only shown in the appendix 2. more directly in the abstract and the introduction describe this as a revival or repositioning of something that already exists and can be very useful to readers. At the moment both overclaim by failing to note that this method is already well known. And, by doing so, cut off the reader from a rich literature about such methods.