We sincerely thank the reviewer aJ9B for the recognition and appreciation of our work, and for the valuable questions as well as comments. For the comments and questions, here are our responses:

Q1: The Lipschitz constant may be difficult to compute rigorously and large in practice.

A1:Thanks for the good question. Previously we use the theorem 3.1 to support our introduction of locality, but it is true that the Lipschitz constant is not tight enough to be used in many real settings. We here discuss a tighter bound and the overall reasonability.

For a single linear layer () with W () and b, its Lipschitz constant is bounded by the spectral norm of W, which is . However, for a Gaussian weight with , it could still have a bound of . In practice, this is still quite big, and it only serves as a worst-case bound.

There is a tighter bound using the local gradient assuming smoothness (which is true for most type of ):

Where .

Empirically, the linearity mainly takes effect when the predictions are reasonably good. It is because

On one hand, when the prediction is poor, the neighbour could be more random and the predicted knn-prediction will give a high entropy itself as well as the model, which will not contradict our purpose of the algorithm (the corresponding sample would very likely have a high entropy and gain, and the theoretical guarantee of knn pred is from the k value);
On the other hand, when the prediction is good (e.g. p>0.7 or higher) in the region, the local gradient will be smaller, and the softmax backward propagation further suppresses the bound with .

So in the cases where the local gradient might fail to give a reasonable bound, the result itself would very likely have high entropy and be correctly estimated as high entropy by both model and knn, and the knn's estimation error doesn't matter too much. The actual algorithm reliance on this linearity is relaxed in bad cases due to the algorithm design. And for regions with good predictions we can empirically evaluate the value.

Q2: The method claims to be able to address data distribution shifts however it was not shown in the benchmarks

A2: The data distribution shifts in the article refer to the problem of model-based active learning: it only emphasizes on samples learned poorly by the model, and the selection causes a distribution problem and worse performance. This is shown in the ablation as dynamic-rechecking/KNN-only are all without this performance drop. We will revise the corresponding parts to make this clear.

Q3: The set of experiments does not demonstrate the generalizability of the method to other data modalities or models, and expanding that evaluation would make the analysis stronger.

A3: Thanks for the advice. We further add semantic segmentation experiment on ADE20k with UperNet(BEiT-V2 backbone). See A2 to Reviewer X845. For modalities other than vision, if there is a sample-level feature, it is also possible to extend the framework to them. Will add more discussion in the revision of future work.

Q4: Typo in the abstract

A4: Thanks, we have fixed it in local revision. Should be "… Info-Coevolution reduces annotation and training costs by 32% without performance loss".

Q5: Hyperparameters used in the framework (such as the distance and similarity thresholds). Could the authors give more insight into finding the optimal values, and how adjusting them might affect the method’s performance.

A5: The hyperparameters should depend on the feature space itself (how dense is the data in the space), and can use empirical methods to evaluate(retrieve some samples to estimate the knn distance distribution and knn-prediction correlation, and an adequate k).

Generally, k = 8, cosine similarity 0.9 is good enough, and cosine similarity 0.85 also works in the setting; 0.95 will be too big as there will be very few near neighbour pairs.

Too small distance (too large similarity threshold) may fail to reduce cost and become a model-based method (for dynamic rechecking. Actually using 0.9 for knn and 0.85 for dynamic rechecking is also a good choice), while too large distance threshold may affect sparse-region samples' knn-prediction (their k-near-nighbour could be further and less linear correlated, and should be kept for the sake of generalization).

In our actual use cases, the threshold value is not sensitive, cosine similarity 0.9 directly works without tuning, and cosine similarity 0.85 doesn't make a statistically significant difference.