Improved Guarantees for Fully Dynamic $k$-Center Clustering with Outliers in General Metric Spaces
We propose a novel fully dynamic algorithm that maintains an $(4+\epsilon)$-approximate solution to the $(k,z)$-center clustering that covers all but at most $(1+\epsilon)z$ points at any time in the sequence.
摘要
评审与讨论
This paper proposes a simple but effective method that can solve "fully dynamic k center problem with outliers" in general metric space. The fully dynamic setting requires the algorithm to adjust its output efficiently when deletion or insertion operations occur. This paper uses a "ball cover" strategy to obtain a -approximate solution of k center problem with outliers. The radii of balls will be guessed close enough to the optimal k center radius after at most iterations. When a deletion or insertion occurs, the algorithm needs to re-cluster (offline algorithm) if necessary. However, the paper proves that the re-clustering operation will not happen frequently. As a result, even though the time complexity of re-clustering may be as high as in worst case, the amortized time can be reduced to for handling the insertion or deletion of a single point.
优点
- The approximate ratio of the proposed algorithm is , far better than previous paper like [5].
- The algorithm of this paper do not need extra assumption of metric space, e.g., low doubling dimension assumption in [2], which makes the algorithm have strong applicability.
- The presentation is nice.
缺点
- Time complexity. The amortized update time is worse than the method in [5] (if is large), whose time complexity is .
- High update time complexity in the worst case. The proposed algorithm still need time to handle an insertion or deletion of a single point in the worst case.
- No experimental result. Although this work is mostly theoretical, it does not means that experiment is not necessary (especially for NeurIPS, not a purely theory venue). The previous work [5] provides the experimental results, so I strongly encourage the authors to compare the performance of both running time and cost to [5].
问题
- About the time complexity of binary search of , you have mentioned that the set of all possible is . So . So if you binary search in the set , it seems like that you only need steps. So can your time complexity improved to ?
- If I want to delete some point which is also a center of the solution. I think this center point should also be deleted since this point has been removed from this metric space. But it seems that Procedure 2 in the paper do not consider this situation. How do your algorithm handle the deletion of the center points?
- What is the size of your memory usage?
局限性
Please see weakness.
Comment: Time complexity. The amortized update time is worse than the method in [5] (if is large), whose time complexity is .
Comment: High update time complexity in the worst case. The proposed algorithm still need time to handle an insertion or deletion of a single point in the worst case.
Response: Besides the improvement in approximation factor with respect to the algorithm in [5], another interesting aspect of our method is that it is adversarially robust, meaning it can handle an adversary who observes the computed solution after each update and adjusts the subsequent updates accordingly. In contrast, the dynamic -approximation algorithm developed by Chan et al. is not adversarially robust. In particular, if their solution is revealed at any time, the adversary can impose an expensive update that requires time in .
Regarding the question about handling an insertion or deletion of a single point in the worst case with time, we believe this can be managed with some adjustments to our algorithm. Specifically, we propose running two parallel instances of our algorithm. In the first instance, we compute and output the solution, while the second instance divides the large operations of time into smaller chunks of size and processes each chunk sequentially after an update. When the solution maintained by the first run becomes invalid—potentially after updates—we switch to the second run, obtain the solution from there, and continue the process. It is important to note that when switching between runs, we obtain a new solution and maintain that updated solution.
Comment: No experimental result. [...] I strongly encourage the authors to compare the performance of both running time and cost to [5].
Response: We agree with the reviewer that providing empirical results or practical evaluations would be valuable and interesting work. Our main focus was to improve the -approximation guarantee and determine how much we could refine the approximation factor, leading to the development of the first adversarially robust dynamic -approximation algorithm for the metric -center problem with outliers. We consider conducting experiments and benchmarking our dynamic algorithm against the algorithm by Chan et al. for both real and synthetic scenarios as future work.
Question: About the time complexity of binary search of , you have mentioned that the set of all possible is . So . So if you binary search the set , it seems like you only need steps. So can your time complexity improved to ?
Response: For each guess , we run an instance of the algorithm for that specific . After each insertion or deletion, we update all these instances. As a result, the update time is affected by a factor of . We apologize for any confusion caused by the title of that paragraph and will revise it accordingly.
Question: If I want to delete some point which is also a center of the solution. I think this center point should also be deleted since this point has been removed from this metric space. But it seems that Procedure 2 in the paper do not consider this situation. How do your algorithm handle the deletion of the center points?
Response: As mentioned in the paper, we studied the non-discrete version of the problem, where centers can be any point in the metric space. Therefore, a center can be a deleted point. We would like to thank you for bringing to our attention the interesting question about the discrete version of the problem we study in the paper. Indeed, we have thought about this question during the short rebuttal period and have provide a proof that explains how our dynamic data structure can also provide a -approximation solution for the discrete metric -center problem with outliers, where centers must be selected from the input set of points. You can find the proof of this claim in the response to reviewer CFX4. We also attached a figure in the global rebuttal for a visual representation. Note that any feasible solution for the discrete version is also feasible for the non-discrete version. Therefore, the optimal radius for the discrete version is at least as large as that for the non-discrete version. This implies that, for some applications, the non-discrete version can offer more precise clustering. We believe both the discrete and non-discrete versions are important, each being better suited for different applications. Besides the improvement in approximation factor, another interesting aspect of our method is that it is adversarially robust, meaning it can handle an adversary who observes the computed solution after each update and adjusts the subsequent updates accordingly. In contrast, the dynamic -approximation algorithm developed by Chan et al. is not adversarially robust. In particular, if their solution is revealed at any time, the adversary can impose an expensive update that requires time in .
Question: What is the size of your memory usage?
Response: The space complexity of our algorithm is . This is because we maintain instances corresponding to different radius guesses, each containing up to points. The space required to temporarily store the sample sets used in Algorithm is in .
The paper studies the k-center clustering problem with outliers in the fully dynamic setting. Specifically, given a metric space (M,d), in the (k,z)-clustering problem, the goal is to find at most k balls minimizing the maximum ball radius while excluding up to z points from the clustering. In the fully dynamic setting, the points are inserted or deleted from the underlying metric space, and the goal is to maintain the clustering faster than recomputing from scratch after each insertion or deletion.
The main contribution of this paper is a fully-dynamic, provable approximation algorithm for the (k,z) clustering problem that achieves (4+\eps)-approximation ratio, and can support point updates in O(\eps^{-2} k^6 \log k \log \Delta), with the caveat that the algorithm covers all but at most (1+\eps)z points (also known as a bi-criteria guarantee in the approximation algorithms literature). The algorithm is randomized, and works against an oblivious, online adversary. It also improves upon the approximation ratio of (14+\eps) achieved in the recent work Chan, Lattanzi, Sozio, and Wang [5] for the same problem, at the cost of increasing the running time by a factor of k^4.
From a technical point of view, the paper first introduces a static (4+\epsilon) approximation algorithm for the problem. The algorithm is itself based on the idea of successive random sampling, which proceeds as follows:
- guess an optimal radius r (this is standard in (clustering) algorithms, and can be achieved by binary searching at the cost of paying (1+\eps) in the approx. factor, and a factor that is log of aspect ratio in the underlying metric space in the runtime);
- pick uniformly at random a subset of points S of the current point set, and pick a center c from S that covers a good fraction of the current points;
- grow a ball of radius 4r around the center c;
- remove the points belonging to the ball from the current point sets, and continue with the next iteration (level)
- the number of levels is bounded by roughly k
- the centers computed in each level are output as the solution the (k,z) clustering problem
The bulk of the effort is then to carefully set up a data-structure that maintains these levels (and the associated information) under point insertion and deletions. There are two main invariants: (i) the level invariant and (ii) the dense invariant (which is roughly controlling the number of points that remain unclustered in each level). The moral message of the paper is that when the invariants are broken at some level i, then you basically re-cluster everything from that level up.
优点
*The k-center clustering is a fundamental problem in the clustering/unsupervised learning literature. The outlier version of the problem makes perfect sense, and it's of equal importance. The dynamic setting is very natural and has received increasing attention in the clustering literature in the last 5 years or so, with many papers trying to nail down the dynamic complexity of basic clustering objectives such as k-center, k-median and k-means. This work can be thought as a continuation of this line of work.
- The algorithm is simple and elegant (it should also be very implementable). Appropriate effort is put in explaining high level ideas. The paper reads well.
缺点
-
Personally, I wouldn't agree that this is an improvement over the work of Chan et al. [5]. Indeed, it does achieve a better approximation ratio, but at the cost of increasing the runtime. The paper indeed provides a new, interesting trade-off, but I believe the abstract we should discuss this more carefully.
-
Regarding techniques, while one can argue that idea of successive sampling is now folklore, one influential work that employs a very similar algorithm is due to Mettu-Plaxton "Optimal Time Bounds for Approximate Clustering". While the original algorithm is for k-median and k-means, there are follow up works that employ the same algorithm for k-center (albeit with worst approx guarantee than 2). I would encourage the authors to discuss differences/similarities between Mett-Plaxton and their work. In my opinion, this doesn't diminish the contribution of the paper at hand.
问题
Minor comments:
Lines 17-29, you should back up with citations the applications of clustering across many subfields of computer science and beyond Line 36, 'various complications' sounds a bit odd (maybe challenges would be a better fit here) Line 196, n_i, \alpha never defined before in text Line 219, should your j start from 1 and not from i? -- similar comment for Definition 3.2
Major comment/question: Line 146 -- can you please elaborate whether non-discrete version of the (k,z) clustering you study here is less challenging? I imagine this helps you a lot with deletions; even if you delete a point in P, you can still leave the underlying point in M serve as cluster center, which doesn't invoke a re-build. Does the work of Chan et al. [5] have the some restriction?
局限性
Yes,
Comment: Personally, I wouldn't agree that this is an improvement over the work of Chan et al. [5]. Indeed, it does achieve a better approximation ratio, but at the cost of increasing the runtime. The paper indeed provides a new, interesting trade-off, but I believe the abstract we should discuss this more carefully.
Response: Thank you for the comment. We will discuss this trade-off in more detail in the full version. Besides the improvement in approximation factor, another interesting aspect of our method is that it is adversarially robust, meaning it can handle an adversary who observes the computed solution after each update and adjusts the subsequent updates accordingly. In contrast, the dynamic -approximation algorithm developed by Chan et al. is not adversarially robust. In particular, if their solution is revealed at any time, the adversary can impose an expensive update that requires time in .
Comment: Regarding techniques,[...] I would encourage the authors to discuss differences/similarities between Mett-Plaxton and their work. In my opinion, this doesn't diminish the contribution of the paper at hand.
Response: Thank you for highlighting the work of Mettu and Plaxton on 'Optimal Time Bounds for Approximate Clustering' and the subsequent research. We will be sure to discuss the differences and similarities between the work of Mettu and Plaxton and ours.
Question: Lines 17-29, you should back up with citations the applications of clustering across many subfields of computer science and beyond Line 36, 'various complications' sounds a bit odd (maybe challenges would be a better fit here) Line 196, never defined before in text Line 219, should your j start from 1 and not from i? -- similar comment for Definition 3.2
Response: Thank you for the valuable suggestions. We will make sure to add more references and the mentioned definitions at the correct positions in the full version.
The subset of points not covered in level is disjoint from the clusters . Hence, a clustering of does not need to contain these clusters. Similarly, a deletion of a point at level does not lead to a violation of the invariants in levels , but can potentially violate the invariants at higher levels. Therefore, we only need to recluster the levels from upwards. We will make sure to write this part more clearly in the full version.
Question: Major comment/question: Line 146 -- can you ,[...] Does the work of Chan et al. [5] have the some restriction?
Response: We would like to thank you for bringing to our attention the interesting question about the discrete version of the problem we study in the paper. Indeed, we have thought about this question during the short rebuttal period and determined how our dynamic data structure can also provide a -approximation solution for the discrete metric -center problem with outliers, where centers must be selected from the input set of points. Observe that any feasible solution for the discrete version is also valid for the non-discrete version. Therefore, the optimal radius in the discrete version is at least as large as that in the non-discrete version. This suggests that, in certain applications, the non-discrete version may provide more precise clustering. We believe that both the discrete and non-discrete versions are valuable, each being particularly suited to different applications. Chan et al. [5] consider the discrete version; however, they need to reconstruct the leveling after the deletion of a center, making their algorithm not adversarially robust.
Here, we prove how our data structure can also offer a -approximation solution for the discrete version. We also attached a figure in the global rebuttal for a visual representation.
Claim: Our data structure can report a -approximation solution for the discrete version of the -center problem with outliers without increasing the time or space complexity.
Proof: Let and be fixed, and be the center of cluster (Recall that is the ball of radius centered at ). To provide a solution for the discrete version, we do the following.
As long as is not deleted, we report as the -th center. After the deletion of the point , we consider two cases: or . For the first case we have as and . Hence, we can report all the points as outliers and stop. The dense invariant states that . Therefore, holds in the second case and there exists a point . Then we report the point after the deletion of .
We next prove the -approximation guarantee. To do this, we show that . This implies that any feasible solution in our data structure with radius for the non-discrete version can be used to report a feasible solution for the discrete version. Also, see attached the figure in the global rebuttal for a visual representation. Let . We show that . Since , we have , where is the distance function. Moreover, since . Then by the triangle inequality we have
The complexity of space and time remains to be discussed. To report a solution for the discrete version, it is enough to keep . Note that our data structure already stores , and since , the space complexity and time complexity remain the same.
This paper studies the fully-dynamic -center with outliers problem in the metric space. In this setting, operations (including insertion and deletion) appear over time. The performance evaluation of an algorithm is based on its cost approximation and the (amortized) update time. However, previous research has shown that when exactly excluding outliers is required, any -approximation algorithm incurs an update time. It is therefore reasonable to allow for a trade-off by permitting outliers in order to achieve efficient update times.
The proposed algorithm can be summarized as follows: It partitions the dataset into into at most levels, where each level represents a cluster with center . The algorithm dynamically maintains a data structure that adapts to the operations performed in real-time. At every time, the data structure should maintain the following two invariants:
- level invariant: Each level has a cluster such that .
- dense invariant: For each center in , covers sufficient points in .
The authors prove that it is a -approximation algorithm with expected amortized update time.
优点
- The paper is very well written and easy to follow.
- The algorithm provides a -approximation algorithm for the fully-dynamic -center with outliers problem, which is an improvement over previous work.
- The paper includes a thorough theoretical analysis, proving the approximation guarantee and update time of the algorithm.
缺点
- the update time complexity is not competitive, particularly for large .
- The paper does not provide empirical results or practical evaluations of the algorithm, which could demonstrate its performance in real-world scenarios.
问题
- can this method be slightly modified to support the change of and . Alternatively, is there any negative results on handling the dynamic and .
- iDoes the analysis of the algorithm include a specific space complexity? Additionally, i am curious about the hardness of the fully-dynamic clustering with outliers. Can you provide the lower bounds of the space and the update time in the context of the fully-dynamic setting?
局限性
thoretical paper
Comment: the update time complexity is not competitive, particularly for large .
Response: Our goal was to develop a dynamic algorithm with a low approximation guarantee and a simple, elegant data structure. We did not focus on optimizing the exponent of in our running time. An interesting aspect of our method is that it is adversarially robust, meaning it can handle an adversary who observes the computed solution after each update and adjusts the subsequent updates accordingly. In contrast, the dynamic -approximation algorithm developed by Chan et al. is not adversarially robust. If their solution is revealed at any time, the adversary can impose an expensive update that requires time in .
Comment: The paper does not provide empirical results or practical evaluations of the algorithm, which could demonstrate its performance in real-world scenarios.
Response: We agree with the reviewer that providing empirical results or practical evaluations would be valuable and interesting work. Our main focus was to improve the -approximation guarantee and determine how much we could refine the approximation factor, leading to the development of the first adversarially robust dynamic -approximation algorithm for the metric -center problem with outliers. We consider conducting experiments and benchmarking our dynamic algorithm against the algorithm by Chan et al. for both real and synthetic scenarios as future work.
Question: Can this method be slightly modified to support the change of and . Alternatively, is there any negative results on handling the dynamic and .
Response: Our data structure is designed with up to levels, based on known values of and . We are unsure how to modify our data structure to handle changes in and during the execution of the algorithm. Investigating the development of such a dynamic algorithm would indeed be an interesting challenge.
Question: Does the analysis of the algorithm include a specific space complexity? Additionally, i am curious about the hardness of the fully-dynamic clustering with outliers. Can you provide the lower bounds of the space and the update time in the context of the fully-dynamic setting?
Response: The space complexity of our algorithm is . This is because we maintain instances corresponding to different radius guesses, each containing up to points. For the sample sets needed in , we also need space because they form a subset of the dataset. Establishing a space lower bound for fully dynamic algorithms for -center with outliers would be an intriguing question.
Thanks for the clarification. I will keep my original rating.
The paper gives a new algorithm for the dynamic version of k-center clustering with outliers. The algorithm works in the fully dynamic model with both point insertions and deletions allowed. The points can belong to an arbitrary metric space, compared to some previous algorithms addressing low-dimensional metric spaces.
优点
This is an important problem and the paper significantly improves the approximation factor.
缺点
The algorithm is randomized and works only against oblivious adversaries. This is a shared characteristic with the previous paper on this topic.
The algorithm achieves a better approximation factor than the previous algorithms, but it comes at the cost of significantly higher dependency in the update time on the number of clusters, k. In particular, the factor of is probably prohibitive in practice.
Overall, even though this is a solid contribution to the study of the problem, this is not a breakthrough.
I also wish the authors did a better job highlighting the technical ideas that lead to the improvement over the previous algorithm.
问题
Is there a good reason why the algorithm is randomized? What are the obstacles to obtaining deterministic algorithms in this model?
The factor of in your upper bound is large. Is there a reason why in practice the algorithm would be significantly less costly?
局限性
No specific societal limitations to address. This is a theoretical work concerning a traditional computational problem.
Comment: The algorithm is randomized and works only against oblivious adversaries. This is a shared characteristic with the previous paper on this topic.
Question: Is there a good reason why the algorithm is randomized? What are the obstacles to obtaining deterministic algorithms in this model?
Response: We address these two points together in our response. Interestingly, we have discovered that our dynamic algorithm not only can handle oblivious adversaries but is even adversarially robust. A dynamic algorithm is called adversarially robust if its performance guarantees (including the approximation factor and update time) are preserved even when the sequence of updates (i.e., insertions and deletions) is adaptively chosen by an adversary who observes the outputs of our algorithm throughout the sequence and adjusts them accordingly. An adversarially robust dynamic algorithm is more powerful than one designed to handle only an oblivious adversary, as an oblivious adversary cannot access the algorithm's outputs after each update. In particular, we only use random bits to sample from large clusters at each level and maintain a solution. (This is the only component of our algorithm that uses randomness.) The adversary, however, can observe the maintained solution after each update and adaptively modify future updates (insertions and deletions) accordingly. In contrast, the dynamic -approximation algorithm developed by Chan et al. is designed only for oblivious adversaries and is therefore not adversarially robust. In particular, if their solution were revealed at any time, the adversary could impose costly update time .
Question: The factor of in your upper bound is large. Is there a reason why in practice the algorithm would be significantly less costly?
Response: Thank you for raising this interesting question. The worst case occurs when the gap between the size of the ball chosen in the offline algorithm and the threshold is negligible, but we believe this is not always the case in practice. To increase this gap, we propose selecting the ball of maximum size in Line 8 of the OfflineCluster procedure. This adjustment does not change the theoretical bounds but can make the algorithm faster in practice.
Thank you for your response!
we have discovered that our dynamic algorithm not only can handle oblivious adversaries but is even adversarially robust
This is very interesting and if true, it definitely makes your algorithm more interesting. I'm very curious about this but I'm not sure I'll have time to reread your paper to see if I have any doubts about this. Whatever happens to your paper, it would of course be great to explicitly address this in any future version of the paper.
Thank you to the reviewers for their insightful comments and valuable feedback. We also appreciate the time they dedicated to reviewing our work.
Here, we also provide a file with a figure illustrating our responses to reviewers CFX4 and fWP4.
This paper gives new ratio-time trade-offs for k-center with outliers in a general metric space. The reviews are overall positive, and the main strength is the improved ratio (albeit the dependence on k in running time is slightly worse). The main weakness is that the paper does not seem to make significant progress in terms of techniques, and there is no experiment.
During the discussion, the authors claim their result also works against an adaptive adversary (which was not mentioned in the submitted version). This feature is important and could be a major strength of the paper. The authors should include the relevant claim/proof to the final version.