PaperHub
6.8
/10
Poster4 位审稿人
最低6最高7标准差0.4
7
6
7
7
4.3
置信度
正确性3.0
贡献度2.8
表达3.3
NeurIPS 2024

Interactive Deep Clustering via Value Mining

OpenReviewPDF
提交: 2024-05-08更新: 2024-11-06
TL;DR

We propose incorporating external user interaction to tackle hard samples at cluster boundaries, which existing deep clustering methods fail to discriminate internally from the data itself.

摘要

关键词
Deep Clustering; Interactive Clustering

评审与讨论

审稿意见
7

This paper proposes to incorporate user interactions to deal with the hard samples in deep clustering. By asking the user for the cluster affiliations of boundary samples, the proposed framework improves the performance of existing clustering methods.

优点

  1. The hard boundary samples are indeed the performance bottleneck of most existing clustering methods, since they cannot be easily distinguished in a fully unsupervised manner. This work proposes to incorporate user interaction to guide the clustering process, which is a natural thought and a practical solution.
  2. Experiments demonstrate the effectiveness of the proposed IDC framework in boosting clustering performance with reasonable interaction costs.
  3. The paper is well-structured and easy to follow in general.

缺点

  1. I noticed that the hyper-parameters are different when applying the proposed IDC to TCL and ProPos. However, it is daunting to tune the hyper-parameters for unsupervised tasks. The authors need to provide detailed instructions on how these two parameters should be tuned in practice. Besides, parameter analyses on λ1\lambda_1 and λ2\lambda_2 are expected to investigate the robustness of IDC.
  2. As shown in Table 2, selecting samples based on "Hard+Rep" leads to even inferior performance than the pre-trained baseline. More explanations are needed on the unusual phenomenon.
  3. The current utilization of user feedback is relatively simple. More advanced and thorough utilization strategies are worth exploring to reduce the interaction cost.

问题

See weaknesses.

局限性

The authors do not discuss the limitations of this work. I think that the cost of selecting 500 images for interaction is non-negligible, which should be included in the manuscript as a potential limitation.

作者回复

We appreciate your positive feedback and suggestions. Below we provide the point-by-point response to your concerns.

Weakness 1: Inconsistent λ1,λ2\lambda_1, \lambda_2 for different clustering models

We agree that it is daunting to tune the hyper-parameters for unsupervised tasks. In the previous version, we roughly tuned λ1,λ2\lambda_1, \lambda_2 to produce relatively cluster-balanced inquiry samples. According to your suggestion, we applied our IDC to the pre-trained TCL clustering model with λ1=1,λ2=1\lambda_1=1, \lambda_2=1 (the same configuration for ProPos). It turns out that IDC works well with the default hyper-parameters.

CIFAR-10CIFAR-20STL-10ImageNet-10ImageNet-Dogs
λ1\lambda_1λ2\lambda_2NMIACCARINMIACCARINMIACCARINMIACCARINMIACCARI
2.50.583.991.682.859.469.349.585.892.784.792.997.093.567.878.062.3
1.01.084.492.784.858.169.448.785.392.784.693.297.293.969.178.863.6

As a result, we decide to remove the two tunable parameters λ1,λ2\lambda_1, \lambda_2 in the next version.

Weakness 2: Explanations of ablation study results

The representativeness metric encourages the model to select samples in high-density regions. As shown in Figure 3(b), with the "Hard+Rep" criterion, the selected samples will be heavily imbalanced across clusters. As a result, during the model finetuning, the hard samples would collapse into a few clusters (easy samples will be restricted by the regularization loss), leading to inferior performance even worse than the pre-trained baseline. Such a result is exactly our motivation for designing the diversity metric during sample selection.

We will supply the above explanations in the revised manuscript.

Weakness 3: More advanced utilization strategies

Thanks for the suggestion. We agree that more advanced strategies could utilize user feedback more thoroughly. Some potential directions would be introducing mixup augmentations, supervised contrastive loss, or focal loss to improve the model finetuning. We will explore these improvements in future works.

Limitation: Interaction cost

According to your comments, we asked three people to answer the 500 queries on CIFAR-20 and ImageNet-Dogs. On average, it took about 6 seconds to decide the cluster affiliation relative to the nearest cluster centers for each sample, and querying 500 samples requires about 50 minutes. Nevertheless, querying 100 samples, which takes only 10 minutes, already achieves half the performance improvement brought by 500 samples as shown in Fig. 4. In practice, the user could flexibly decide the number of queries based on the demand, making a trade-off between efficiency and performance.

We will add the above discussions to the revised manuscript.

评论

The authors have addressed my concerns well, and I decide to raise my score to Accept.

评论

We appreciate your feedback on our response and raising the score. We are pleased to learn that our response addressed your concerns and will revise the paper based on your comments.

审稿意见
6

This paper proposes a way to integrate user-feedback to fine-tune clustering models and improve their performance on hard samples, i.e. samples near cluster boundaries. In particular, the authors propose a score to value and select samples for user-interaction. This score is a combination of hardness, representativeness and diversity, each one defined with a mathematical formulation. Samples are ranked based on their value and the top M samples are selected for user-interaction. Then a user gives feedback on each selected sample by indicating whether this sample belongs to one of the nearest cluster centers (positive feedback), or it does not belong to any of them (negative feedback). Once this user-feedback is gathered, the model is fine-tuned on this information, and a regularization term is included to ensure the model still retains its predictions on high-confidence samples. While the hardness term in the score encourages hard samples to be selected, representativeness encourages samples from dense regions to be selected, and finally diversity ensures samples are not selected only from a handful of small regions. The authors apply their method to two recent pre-trained clustering models, and compare their results with alternative methods on five image datasets.

优点

  • The paper deals with the important problem of improving model clustering performance on hard samples.
  • The idea is simple and well-presented, and looking at the reported metrics it leads to improved results. This however might not be an extensive evaluation (see Weaknesses below).
  • On the experimental settings considered by the authors, many state-of-the-art methods are reported for comparison.

缺点

  • One intrinsic and crucial weakness of this approach is the assumption that a user is able to give meaningful and accurate feedback on the clustering. In many real world settings and with complex data types (e.g. genomics) this is an unreasonable assumption to make, hence the method cannot be applied. In other settings (e.g. medical applications) such a feedback might require domain experts and hence be very costly. This intrinsic limitation significantly impacts the practical relevance of the method.
  • While the results for the presented method look positive compared to state-of-the-art, important information is missing to really assess if the clustering performance is enhanced, and specifically if it is actually enhanced on hard samples over which the model does not receive user-feedback. This is particularly important since the authors claim "Different from existing studies that pursue overall performance improvements, this work aims to address the specific hard sample problem." In detail, how is the performance after fine-tuning on user-feedback calculated? Is there a hold-out test set? If the same data used for user-feedback and fine-tuning is then used to test model performance (e.g. when computing results in Table 1), then I think the metrics reported are not strong evidence. In fact the improved performance might be due mainly to the model clustering better the samples over which it receives user-feedback. What would be important evidence for the value of the proposed method is showing that after fine-tuning the model performs better specifically on hard samples over which it did not receive user-feedback nor was fine-tuned on. Unfortunately this seems to be missing.
  • There is no Appendix to the paper, and I think it is a pity, since important details regarding e.g. datasets or training/test splits and experiments are missing. How big is the number of samples M=500 selected for user-feedback relative to the total size of each dataset? Also I think it would be important to give more details on user-interaction (e.g. mentioning how many users participated, how were they selected, describing the interface used to give feedback).
  • Figure 4 shows that for the number of samples M=500 selected for user-feedback in the experiments the impact of the proposed sample selection strategy is marginal, especially for ImageNet-Dogs. In particular, since the accuracies of the proposed strategy and the random one are close, it would be good to show also the same ablation results for ARI and NMI to see if the selection strategy yields a concrete improvement on these metrics.

问题

Some questions for the authors and suggestions for additional results to show are in the Weaknesses section. Below are suggested minor changes for quality of the text.

  • Line 53: change "model-irrelevant" to "model-agnostic"
  • Line 217: change "state of the arts" to "state-of-the-art"

局限性

While in the checklist the authors state that they discuss the limitations in the conclusion section, I do not find that this is the case, and I think it would be important to explicitly mention important limitations of the proposed approach. Examples of such limitations are the fact that this method relies on the user being able to give accurate feedback on the clustering, which in many real-world settings is not reasonable to assume, and the fact that the method is only tested on image data (and not other modalities where user feedback might be less effective or more costly to get).

作者回复

Thanks for your recognition of the problem this work aims to address, as well as our simple yet effective idea. We sincerely appreciate your insightful and constructive comments. Below are the point-by-point responses to your concerns raised in weaknesses, questions, and limitations.

Weakness 1 & Limitation: Imperfect user feedback

We agree with you that providing accurate user feedback would be more costly when handling more complex data types. However, we would like to highlight the motivation of our work, namely, introducing user interaction to address the hard sample problem suffered by fully unsupervised clustering methods. If a data sample is so hard that even humans cannot distinguish it, then machine learning methods are more unlikely to correctly cluster it solely based on the data itself. In other words, by selecting the most valuable samples for user queries, our method provides a chance of grouping samples undistinguishable by clustering methods, with additional user interaction costs as low as possible. We believe such a solution caters to practical needs.

Moreover, in real-world genomic and medical applications, the users are very likely to be domain experts since not everyone has the demand for clustering genomic data. Besides, as we later show in the response to Weakness 3, our method is robust to the potential mistakes in the user feedback, which suits real-world applications.

Weakness 2: More evidence of performance improvement

According to your comments, we further assessed the experimental results and concluded that the performance improvement of our IDC on hard samples is not incremental. Specifically, following the common understanding, we treated samples with cluster assignment confidence lower than 0.9 as hard samples (we did not use the hardness metric defined in the paper as the logarithmic value is less intuitive). Note that the sample confidence for ProPos was computed through a cluster head initialized with Kmeans cluster centers. Then, to better understand the performance gain brought by IDC on hard samples, we partitioned hard samples into two subgroups, namely, i) hard-in, consisting of hard samples selected for query, and ii) hard-out, consisting of hard samples not selected for query. We reported the proportion of hard-in samples among all hard samples, and the clustering accuracy of hard-in/out samples before/after the model finetuning. Below are the results of IDC applied to the TCL and ProPos models, respectively.

IDCTCL_{\textrm{TCL}}CIFAR-10CIFAR-20STL-10ImageNet-10ImageNet-Dogs
hard-in ratio4.6%2.3%35.1%22.9%7.9%
hard-in ACC (before)47.928.238.850.029.6
hard-in ACC (after)99.4 (↑51.5)81.7 (↑53.5)98.9 (↑60.1)94.3 (↑44.3)89.8 (↑60.2)
hard-out ACC (before)48.523.943.549.727.1
hard-out ACC (after)60.4 (↑11.9)37.4 (↑13.5)62.5 (↑19.0)70.5 (↑20.8)55.1 (↑20.0)
IDCProPos_{\textrm{ProPos}}CIFAR-10CIFAR-20ImageNet-10ImageNet-Dogs
hard-in ratio3.0%1.5%8.0%7.4%
hard-in ACC (before)58.640.248.638.9
hard-in ACC (after)99.1 (↑40.5)96.0 (↑55.8)97.2 (↑48.6)82.2 (↑43.3)
hard-out ACC (before)58.529.253.138.3
hard-out ACC (after)81.5 (↑23.0)63.0 (↑33.8)77.8 (↑24.7)58.6 (↑20.3)

From the above results, one could arrive at three conclusions: i) the selected query samples only take a small percentage among all hard samples. In other words, simply correcting the cluster assignments of selected samples could only bring small improvements on hard samples; ii) the clustering accuracy of selected hard samples improves significantly, which is natural since these samples are used for finetuning; and iii) the clustering accuracy of hard samples not selected for query also greatly increases after finetuning, which demonstrates that the user feedback benefits the held-out data as well.

To further show that the user feedback also boosts the clustering performance of samples not selected for interaction, we added a baseline by manually correcting the cluster assignments of the 500 query samples, denoted by TCL^{\dagger} and ProPos^{\dagger} in the table below.

CIFAR-10CIFAR-20STL-10ImageNet-10ImageNet-Dogs
NMIACCARINMIACCARINMIACCARINMIACCARINMIACCARI
TCL81.988.778.052.953.135.779.986.875.787.589.583.762.364.451.6
TCL^{\dagger}82.288.978.453.253.536.182.088.678.588.690.485.062.865.652.3
IDCTCL_{\textrm{TCL}}83.991.682.859.469.349.585.892.784.792.997.093.567.878.062.3
ProPos87.893.687.159.159.143.6---88.995.289.673.076.966.9
ProPos^{\dagger}87.993.787.359.359.443.8---89.695.590.373.877.867.8
IDCProPos_{\textrm{ProPos}}90.595.790.969.278.361.4---93.297.394.177.686.174.8

The results show that solely correcting the cluster assignments of 500 samples brings marginal performance improvement since they are only a small portion of the data. On the contrary, our IDC effectively utilizes user feedback on 500 query samples, which leads to much better performance.

To sum up, the above results demonstrate that our user interaction strategy could greatly improve the clustering performance on samples, no matter whether they are selected for the user query. We will add the above results and discussions to the revised manuscript.

Due to the space limitation, we put the remaining part of the responses in the global author response above.

评论

We wanted to thank you again for your meticulous review and valuable comments. As the discussion period is coming to a close, please let us know if you have any concerns that need to be further addressed. We truly appreciate a re-evaluation accordingly.

评论

I appreciate the effort that the authors put in the rebuttal, which helped addressing some of my concerns. After also reading the other reviews and replies, I raise my score accordingly.

评论

Thank you for taking the time to read our rebuttal and for raising the score. We will further improve our work by incorporating your constructive comments.

审稿意见
7

This work proposes an interactive deep clustering framework IDC, which could be integrated with existing deep clustering methods. The key idea is to adjust the decision boundary by querying the cluster affiliations of high-value samples. The authors applied IDC to two pre-trained clustering models on five datasets. Experimental results prove the effectiveness of the proposed method.

优点

  • It is interesting to investigate whether introducing weak supervision to unsupervised clustering would be helpful. In real-world applications, I think most people are willing to improve the clustering performance through low-cost interaction with the clustering algorithm.
  • The proposed IDC is technically sound. The authors prove that the progressive sample selection strategy could select the most valuable samples to reduce the interaction cost.
  • The authors show that IDC outperforms semi-supervised classification and clustering baselines, which proves the effectiveness and superiority of IDC.

缺点

  • In the evaluation, the authors assume the user feedback is always correct, which may not hold in real-world applications. It is worth exploring how IDC behaves when there are mistakes in the user feedback. Alternatively, some examples could be supplied to show that the interaction task is easy enough for the user to correctly predict the cluster affiliations of those hardest samples.
  • Table 1 shows that the previous semi-supervised clustering method Cop-Kmeans even achieves worse results than applying Kmeans on original features. Explanations of the results need to be supplied.
  • Figure 4 shows that the superiority of the proposed sample selection strategy against random selection is marginal on ImageNet-Dogs.

问题

Please refer to the weakness section.

局限性

In this work, the authors assume that the user feedback is always correct. It is unknown how IDC performs when there are mistakes in the user feedback.

作者回复

We appreciate your positive feedback and kind suggestions for our work. Below are the point-by-point responses to your concerns mentioned in the weaknesses and limitations sections.

Weakness 1 & Limitation 1: Imperfect user feedback

According to your advice, we asked three people to answer the 500 queries on CIFAR-20 and ImageNet-Dogs. We counted the number of correct user feedback and used the feedback to finetune the TCL clustering model. The results are shown in the table below. As requested, we also supplied more examples of user queries in Figure 3 in the attached PDF file.

CIFAR-20ImageNet-Dogs
NMIACCARINumber of correct user feedbackNMIACCARINumber of correct user feedback
Pre-trained TCL52.953.135.7-62.364.451.6-
Perfect feedback59.469.349.550067.878.062.3500
User 1 feedback57.566.446.445266.775.359.9460
User 2 feedback58.068.048.245366.474.659.2458
User 3 feedback55.764.544.143165.574.158.9447

The above results demonstrate that: i) it is not difficult for the user to correctly predict the cluster affiliations of the query samples, and ii) IDC is robust to the mistakes in the user feedback, which suits real-world applications.

Weakness 2: Explanations on Cop-Kmeans results

Cop-Kmeans is a semi-supervised clustering method built upon the vanilla K-means. Given the pairwise constraints (i.e., must-link and cannot-link), Cop-Kmeans forces sample pairs with must-link constraints to be assigned to the same cluster and vice versa. In practice, the pairwise constraints often conflict with the cluster assignments made by the vanilla K-means. As a result, the cluster centers would deviate from the underlying data distribution due to "outliers" introduced by the pairwise constraints, leading to inferior clustering performance. Such a limitation of Cop-Kmeans could be attributed to the fixed features during the Kmeans process. In other words, Cop-Kmeans fails when the input features are intrinsically inseparable by assigning each sample to the nearest cluster center. In comparison, the proposed IDC could adjust the features by optimizing the cluster head, overcoming the intrinsic limitation of Cop-Kmeans.

We will add the above discussion to the revised manuscript.

Weakness 3: Marginal improvements against random selection on ImageNet-Dogs

As you pointed out, the superiority of our sample selection strategy against the random selection baseline on ImageNet-Dogs is less significant than that on CIFAR-20. Such a result is in fact reasonable since ImageNet-Dogs contains only 1/3 samples compared with CIFAR-20 (see dataset summary table below).

DatasetSplitSamplesClasses
CIFAR-10Train+Test6000010
CIFAR-20Train+Test6000020
STL-10Train+Test1300010
ImageNet-10Train1300010
ImageNet-DogsTrain1950015

In other words, for the same number of query images, its proportion relative to the entire dataset on ImageNet-Dogs is larger than that on CIFAR-20. As the proportion increases, the performance gap between different sample selection strategies will be less significant, which explains the relatively marginal performance improvement against random selection on ImageNet-Dogs. For a more reasonable validation, to keep the proportion consistent, we investigate the selected sample number MM in the range 0-200 with an interval of 25 for ImageNet-Dogs. The results are provided in Figure 1 in the attached PDF file. As can be seen, when querying for 50 samples, our sample selection strategy outperforms random selection by 3.88 in terms of clustering ACC, larger than the gap of 2.3/0.42 ACC when querying 200/700 samples. The above results demonstrate the superiority of our sample selection strategy, especially when only a small portion of samples are queried.

We will add the above results and discussions in the Appendix in the next version.

评论

I appreciate the authors' response which addressed my concerns.

评论

Thanks for going through our response and raising the score. We will refine our paper following your suggestions.

审稿意见
7

In this work, the authors propose a plug-and-play module to boost the clustering performance of existing methods through user interaction. A sample value evaluation criterion is designed to propose valuable user queries of a high performance-to-cost ratio. Experiments show that the proposed module could significantly improve the clustering performance with low user interaction costs.

优点

While most existing deep clustering works focus on improving feature discriminability, this work improves the clustering performance from a new perspective. Though introducing user interaction seems to violate the unsupervised setting of clustering, it is meaningful in real-world applications. In addition to the interesting topic, the proposed method itself is also technically sound and solid. The authors propose to reduce the interaction costs by selecting high-value samples and designing an easy interaction form, which improves the application prospect of the method.

缺点

  1. In Table 1, I suggest the authors include a performance baseline by considering the 500 query samples are correctly clustered. As a result, the improvement brought by the proposed method would be clearer.
  2. The hyper-parameters are tuned on different datasets, and no parameter analyses are provided.
  3. Since this work is more application-oriented, I encourage the authors to carefully design the interaction interface to further improve the interaction efficiency.

问题

  1. What is the time complexity of the sample selection process? Does it scale to large datasets?
  2. Why are the choices of λ1\lambda_1 and λ2\lambda_2 different for two pre-trained clustering models?
  3. Querying about 500 samples is non-negligible. Have you evaluated how much time it takes for a person to give feedback?
  4. In implementation details, the authors write "For ProPos, we warm up the cluster head with the positive and negative loss in Eq. 12 and 13 in the first 50 epochs, since the prediction confidences are unreliable initially". A recent clustering work (Towards Calibrated Deep Clustering Network, arXiv 2024) shows that initializing the two-layer cluster head with k-means is effective. I wonder if the initialization strategy works in the IDC framework.

局限性

No significant limitations are found.

作者回复

Thanks for your positive feedback and detailed suggestions for our work. Below are the point-by-point responses to your concerns.

Weakness 1: The relative performance improvement

According to your comments, we added a baseline by manually correcting the cluster assignments of the 500 query samples, denoted by TCL^{\dagger} and ProPos^{\dagger} in the table below.

CIFAR-10CIFAR-20STL-10ImageNet-10ImageNet-Dogs
NMIACCARINMIACCARINMIACCARINMIACCARINMIACCARI
TCL81.988.778.052.953.135.779.986.875.787.589.583.762.364.451.6
TCL^{\dagger}82.288.978.453.253.536.182.088.678.588.690.485.062.865.652.3
IDCTCL_{\textrm{TCL}}83.991.682.859.469.349.585.892.784.792.997.093.567.878.062.3
ProPos87.893.687.159.159.143.6---88.995.289.673.076.966.9
ProPos^{\dagger}87.993.787.359.359.443.8---89.695.590.373.877.867.8
IDCProPos_{\textrm{ProPos}}90.595.790.969.278.361.4---93.297.394.177.686.174.8

The results show that solely correcting the cluster assignments of 500 samples brings marginal performance improvement since they are only a small portion of the data. On the contrary, our IDC effectively utilizes user feedback on 500 query samples, which leads to much better performance. We will add the baseline to the revised manuscript.

Weakness 2 & Question 2: Inconsistent λ1,λ2\lambda_1, \lambda_2 for different clustering models

In the previous version, we roughly tuned λ1,λ2\lambda_1, \lambda_2 to produce relatively cluster-balanced inquiry samples. According to your suggestion, we applied our IDC to the pre-trained TCL clustering model with λ1=1,λ2=1\lambda_1=1, \lambda_2=1​ (the same configuration for ProPos). It turns out that IDC works well with the default hyper-parameters.

CIFAR-10CIFAR-20STL-10ImageNet-10ImageNet-Dogs
λ1\lambda_1λ2\lambda_2NMIACCARINMIACCARINMIACCARINMIACCARINMIACCARI
2.50.583.991.682.859.469.349.585.892.784.792.997.093.567.878.062.3
1.01.084.492.784.858.169.448.785.392.784.693.297.293.969.178.863.6

As a result, we decide to remove the two tunable parameters λ1,λ2\lambda_1, \lambda_2 in the next version.

Weakness 3: Design of interaction interface

Thanks for the suggestion, we will provide a user-friendly interaction interface to aid real-world applications.

Question 1: Time complexity and scalability

The proposed sample selection process consists of computing three metrics (hardness, representativeness, and diversity) and selecting the most valuable sample. Denote the number of samples as NN, the number of clusters as CC, and the number of query samples as MM, the time complexity of these steps is analyzed as follows.

For the hardness metric, we compute the cosine distance to the two nearest centers for each sample, which is of O(NC)O(NC) time complexity.

For the representativeness metric, we compute the sum of Euclidean distance of KK nearest neighbors for each sample, which is of O(NK)O(NK) time complexity. Note that the KK nearest neighbors search is of O(N+KlogK)O(N+KlogK) time complexity, which could be omitted.

For the diversity metric, we compute the cosine distance to samples previously selected for each sample, which is of O(NM)O(NM) time complexity.

Finally, selecting the most valuable sample selection requires O(M)O(M) time complexity.

In summary, as the number of query samples MM are usually larger than the number of clusters as CC and the number of nearest neighbors KK, the overall time complexity of the sample selection process is O(NM)O(NM). Since only a small portion of samples are selected for query (i.e., M<<NM<<N), our sample selection strategy scales to large datasets.

Question 3: Interaction time cost

We asked three people to answer the 500 queries on CIFAR-20 and ImageNet-Dogs. On average, it took about 6 seconds to decide the cluster affiliation relative to the nearest cluster centers for each sample, and querying 500 samples requires about 50 minutes. Nevertheless, querying 100 samples, which takes only 10 minutes, already achieves half the performance improvement brought by 500 samples as shown in Fig. 4. In practice, the user could flexibly decide the number of queries based on the demand, making a trade-off between efficiency and performance.

Question 4: Cluster head initialization

Thanks for pointing out the cluster head initialization strategy proposed in the mentioned work. However, the authors did not release the code. We tried initializing the cluster head following the instructions in the paper but failed to reproduce the results. We will keep paying attention to the mentioned work and update the cluster head initialization strategy when applicable.

评论

The authors have addressed my concerns.

评论

Thanks again for your positive feedback and recognition of our work. We sincerely appreciate your support and will improve our paper following your advice.

作者回复

We appreciate the reviewers for their insightful and constructive comments. We supply a PDF containing the supplementary figures and provide the point-by-point responses below.


Due to the space limitation, we put part of the responses to Reviewer 2n5n in this section:

Weakness 3: Dataset and user interaction details

The experimental configs and details are mentioned in Section 4.2. According to your comments, we supply the summary of datasets used in our experiments in the following table. The 500 selected samples accounted for 0.8%-3.8% of the total size of each dataset.

DatasetSplitSamplesClasses
CIFAR-10Train+Test6000010
CIFAR-20Train+Test6000020
STL-10Train+Test1300010
ImageNet-10Train1300010
ImageNet-DogsTrain1950015

For user interaction, in the current version, we assumed that the user gives perfect responses to 500 queries. According to your comments, we asked three colleagues to answer the 500 queries on CIFAR-20 and ImageNet-Dogs. We counted the number of correct user feedback and used the feedback to finetune the TCL clustering model. The results are shown in the table below. We also supplied more examples of the interaction interface in Figure 3 in the attached PDF file.

CIFAR-20ImageNet-Dogs
NMIACCARINumber of correct user feedbackNMIACCARINumber of correct user feedback
Pre-trained TCL52.953.135.7-62.364.451.6-
Perfect feedback59.469.349.550067.878.062.3500
User 1 feedback57.566.446.445266.775.359.9460
User 2 feedback58.068.048.245366.474.659.2458
User 3 feedback55.764.544.143165.574.158.9447

The above results demonstrate that: i) it is not difficult for the user to correctly predict the cluster affiliations of the query samples, and ii) IDC is robust to the mistakes in the user feedback, which suits real-world applications.

We will add the above additional information and results to the Appendix in the next version.

Weakness 4: Marginal improvements against random selection on ImageNet-Dogs

As you pointed out, the superiority of our sample selection strategy against the random selection baseline on ImageNet-Dogs is less significant than that on CIFAR-20, especially when more images are queried. Such a result is in fact reasonable since ImageNet-Dogs contains only 1/3 samples compared with CIFAR-20 (see dataset summary table above). In other words, for the same number of query images, its proportion relative to the entire dataset on ImageNet-Dogs is larger than that on CIFAR-20. As the proportion increases, the performance gap between different sample selection strategies will be less significant, which explains the relatively marginal performance improvement against random selection on ImageNet-Dogs. For a more reasonable validation, to keep the proportion consistent, we investigate the selected sample number MM​ in the range 0-200 with an interval of 25 for ImageNet-Dogs. The results are provided in Figure 1 in the attached PDF file. As can be seen, when querying for 50 samples, our sample selection strategy outperforms random selection by 3.88 in terms of clustering ACC, larger than the gap of 2.3/0.42 ACC when querying 200/700 samples. According to your comments, we supplied figures of ARI and NMI metrics, which show a consistent tendency. The above results demonstrate the superiority of our sample selection strategy, especially when only a small portion of samples are queried.

We will add the above results and discussions to the Appendix in the next version.

Questions: Typo

We sincerely appreciate your meticulous review. We will correct these typos in the next version.

最终决定

This paper introduces an interactive deep clustering framework (IDC) designed to enhance existing clustering methods through minimal user interaction.

All reviewers agree that this work is highly interesting and the proposed method itself is technically sound and solid and leads to improved results. User interactions to guide the clustering process can be important in real-world applications. I vote for accepting this work, as this paper looks at the clustering performance from a new perspective, and I think that this work is interesting to the community. For a clear positioning of the work, it would be good if the authors could discuss their work in the light of this work https://arxiv.org/abs/1805.11571 in the final version.