Clustering with Non-adaptive Subset Queries
We provide efficient non-adaptive algorithms that query a near-linear (and near-optimal) number of subset-queries to infer clustering of a set of elements exactly.
摘要
评审与讨论
[Setting] This paper studies the problem of clustering items into clusters using an oracle that can tell how many ground-truth clusters are represented in any given subset of items. The goal is to develop non-adaptive algorithms (where all queries are chosen before the oracle answers anything) and study their sample complexity.
[Contributions]
- Randomized non-adaptive algorithms for constant k that make:
- queries when subsets of any size can be queried
- queries when
- Randomized non-adaptive algorithms for general k that make:
- queries of size at most
- An lower bound for any non-adaptive algorithm that is allowed to query subsets of size at most .
The developed algorithms run in polynomial time.
Additionally, the paper also presents results for the case when :
- Clusters are roughly balanced (i.e., their sizes are within constant factors of each other). Here, the algorithm makes queries when and queries for any .
- An adaptive round is allowed. This improves dependency on logarithmic factors.
For these results, the details are only in the appendices.
优点
- The paper is fairly comprehensive and presents query complexity results in a wide range of settings. The proposed algorithm are optimal up to logarithmic factors.
- The algorithms are well motivated. I especially appreciate the high-level intuition because of the clarity it adds to the paper.
- While I have some concerns about the problem setting (see Weaknesses), I believe the presented results are a very good first step towards filling an interesting research gap - non-adaptive algorithm for clustering with subset queries.
缺点
- My primary concern is with the feedback model. The oracle returns the number of clusters in the subset , which implies that it knows the correct grouping in . Why then would it only return the number of clusters in in practice and not the clusters themselves? Perhaps there are bandwidth constraints on response in some applications? But wouldn't one require way fewer queries if the oracle returns the clusters in , which alleviates the bandwidth concerns, especially in the non-adaptive setting?
- This is a relatively minor concern, but the paper defers a lot of details to the appendix without even including high-level ideas. I recommend including some details in the main paper, perhaps at the cost of a few technical details from Section 2. For example,
- What changes to the algorithm allow the query complexity to improve for the roughly balanced case?
- What is the "extra round of adaptivity"?
问题
- Please address point 1 under weaknesses.
- It seems to me that both Algorithm 1 and 2 work for any value of k. Can the authors elaborate on what they mean by constant and general ? Do you have two algorithms because one of them has a better dependence on ( instead of )? On a similar note, how should one go about choosing between Algorithm 1 and 2 in practice?
- Your lower bound is based on the idea of converting subset queries to pairwise queries. Unfortunately, this hides the dependence on . Are you aware of any lower bounds in the pairwise case that depend on as well?
局限性
While it is true that all theorems have their assumptions listed, it would help the readers if the authors include a dedicated paragraph pointing out some of the research gaps that are left open by their work.
Thanks for this constructive evaluation. Please see the global response for the question related to motivation. The rest of the responses are provided below. We will also add some high level ideas from the appendix to the main paper as suggested.
Can the authors elaborate on what they mean by constant and general
You are correct that both algorithms work for any value of . We have two algorithms because one of them has better query complexity when is small and the other has better query complexity when is large. Specifically, the algorithm of Theorem 1.2 makes queries while Theorem 1.1 gives . Thus, to be precise, Theorem 1.1 is better when and Theorem 1.2 is better otherwise. In practice, the algorithm should be chosen based on where falls with respect to this threshold. We wrote "constant " because we wanted to emphasize that Theorem 1.2 gives query complexity when is a constant and "general " to emphasize that Theorem 1.1 gives query complexity for all . We will add a few lines in the write up to make this more clear.
Are you aware of any lower bounds in the pairwise case that depend on as well?
For non-adaptive pairwise query algorithms there is an lower bound for which is as strong as possible (for any ) since is also a trivial upper bound for any . For , is possible non-adaptively and this is optimal since the algorithm will need at least one pairwise query involving every point. Thus, this is the complete picture for non-adaptive pairwise query algorithms.
Thank you taking time to respond to my questions. I appreciate the added motivation for subset queries and the clarification regarding constant-vs-general k. All the best for your submission !
The problem of cluster recovery via membership queries asks to assign each point in a dataset to its ground-truth cluster by using as few queries as possible to an oracle. A well-studied oracle is the same-cluster oracle that answers, given two points, whether the points are in the same or in different clusters. The complexity of adaptive algorithms for this problem is , but non-adaptive algorithms require for . The authors of this paper consider subset queries to surpass this barrier. Given a set of points , a subset query returns the number of clusters the points in belong to. It is known that the complexity of an adaptive algorithm is . The authors consider non-adaptive algorithm and obtain the following results.
For general subset queries, they obtain an algorithm that makes queries for unbalanced clusters, and queries for balanced clusters. For constant , they obtain an algorithm that makes queries. If the allowed subset size of the queries is bounded by , their algorithms need an essentially optimal number of queries for constant , and for arbitrary . Finally, they show that 2-round adaptivity can improve on the log factors in the complexity.
The algorithms follow two algorithmic approaches. For constant , the authors observe that one can use subset queries to identify an isolated point representing its cluster in a subset and explore the cluster from this point, or one can use subset queries that contain two points from the same cluster to sample intra-cluster edges. The two strategies have a reverse complexity with respect to the cluster size, and a combination of them give the final complexity trade-off. For general , the authors related subset queries to Boolean OR queries, and use these to recover clusters one by one.
优点
The paper provides a rather rich collection of results on foundational variants of the non-adaptive cluster recovery problem with near-linear complexity for unbounded subsets.
缺点
Unbounded subset size may be a rather strong assumption, and the results for bounded subset size of non-adaptive algorithms are not near-linear and almost trivial for constant subset size. One may be interested in few-round adaptive algorithms with bounded subset size.
问题
/
局限性
/
Thank you for the very thoughtful review.
Unbounded subset size
We agree that unbounded subset size is a strong assumption and we would like to better understand the query complexity for bounded size in follow-up work. We have shown that is possible non-adaptively with and this is essentially optimal in terms of and . It would be interesting to know if near-linear query complexity is possible with query sizes that are even smaller than , say . We know this is not possible non-adaptively, but your idea of studying bounded subset size algorithms with few rounds of adaptivity is a great idea to look at in follow-up work.
Thank you for your response! If I understand correctly, the details you provided agree with my understanding at large, and in particular, the bound is essentially optimal for "the largest possible" only (and not all ). If that's not the case, please follow up and feel free to let me know.
Thank you for seeking clarification. We apologize if there were some ambiguities in the previous response.
Our result is optimal for all values of within a factor.
In particular, our algorithm in Theorem A.1 (or the simplified version, 1.4) achieves non-adaptive queries for any . Thus for constant k (which is the regime for Theorem A.1 and 1.4), our bounds are optimal for all values of within the factor since is a lower bound for non-adaptive queries.
So to summarize, it is not correct that the bound is optimal only for "the largest possible" . Instead, the bound is optimal for all values of .
Sorry for the confusion, that was my bad, and thank you for the clarification!
Edit note for review: Taking the whole rebuttal into account, I think this paper should be accepted (change score 6 -> 7).
The paper studies the clustering problem with non-adaptive subset queries. The problem formulation is as follows. Suppose we are given an oracle such that for any query on a subset of vertices, the oracle returns how many clusters are in in the optimal clustering. The goal is to recover the optimal clustering with subset queries that are as small as possible. The problem setting is similar to the clustering with same-cluster query problem, where the oracle returns whether two vertices are in the same cluster in the optimal clustering. Both problems are motivated by scenarios where we could use semi-supervised learning to obtain such types of oracles. However, in that setting, the number of necessary queries becomes when is at least . As such, to circumvent this strong barrier, it is natural to look into the new model with extra power for the queries.
In the new model, we can easily obtain an algorithm queries by binary search. However, such an algorithm requires at least rounds of adaptivity. The paper focused on the non-adaptive setting, where the queries have to be made independent of the results of previous queries. The main results include an algorithm that makes queries when (linear dependency on ), and; an algorithm that makes queries for general . The paper also studied the case when the maximum size of is bounded and obtained tight bounds when .
Techniques. The two main algorithms used different ideas. For the algorithm with constant , the key observation is that after recovering the biggest cluster(s) in the queried set, there are non-trivial probability that the remaining of the queried set has exactly one or two points. For the two cases, we can recover clusters via subset queries for some specific sizes. As such, we can repeat enough times, and use a ‘valid’ set to recursively recover the clusters in the reconstruction phase. For the algorithm with general , the paper reduces the problem to solving the OR queries of boolean vectors. In this way, the paper provides some new results for OR queries and obtains the desired bound.
优点
I have a positive view of the paper. In my opinion, the problem setting is well-motivated, and it is a natural extension of the well-studied same-cluster query model. The paper is well-written, and the authors did a nice job explaining the ideas in the high-level overview. Therefore, although the techniques are non-trivial, I managed to follow the ideas well. Given the page limits of the conference, the paper still managed to do fairly well in terms of organization.
缺点
I don’t see any major issue in the paper. Some minor issues to address:
- It might be good to further clarify what ‘non-adaptive’ means in your paper. I think in your algorithm, you could still adaptively ‘add on’ to clusters after making the queries. For a moment I thought your algorithm could conduct rounds in parallel to recover all clusters, which will be even stronger in the ‘non-adaptive’ notion.
- The paper doesn’t contain experiments. I’m personally OK with it, but I think for NeurIPS you might want to justify why this is not an issue. I think AKB [NeurIPS’16] (‘clustering with same-cluster queries’) gives a nice motivation for the same-cluster oracle model, and maybe this paper should provide some similar justifications.
- The dependence on should be mentioned after theorem 1.2. I think is really not that bad – I thought it was something like .
问题
Most of the questions are asked in the ‘weakness’ part. Two MISC questions:
- Line 165: should it be ?
- If we just want the clustering of an fraction of the vertices, can we use an even smaller number of queries?
局限性
I do not see any potential negative societal impacts.
Thanks for this very encouraging message. Please see the responses to your evaluation below.
It might be good to further clarify what ‘non-adaptive’ means in your paper
Thank you, that is a good point. Our notion of non-adaptivity is only that the queries are made in one round. I.e. queries are not made based on responses from previous queries. Once the query responses are obtained there is no adaptivity-esque restriction on how the algorithm recovers the clusters. We agree that this should be clarified, and will do so in the future version.
Experiments
Being a theoretical paper, we thought simulations will not add a lot of value to the paper; however we can certainly add simulation of our algorithm.
The dependence on
Thank you, that is a great point. We will add clarification regarding the dependence on directly after the theorem statement.
Line 165: should it be ?
No, here is correct since could be as small as and we need to approximate within a constant factor for some . Note that in this paragraph we are only describing how this strategy leads to a query algorithm and the comes by combining the two strategies as described in the subsequent paragraphs.
If we just want the clustering of an fraction of the vertices, can we use an even smaller number of queries?
This is a great question. Identifying the cluster containing any one point requires bits and so determining the clustering of points requires bits. Since a query gives bits of information we require queries to cluster an -fraction of the points. In particular, for any constant , the lower bound still holds.
To answer your question directly, we do think there is a simple way to modify our algorithm to give improved query complexity in this case. For instance in "Strategy 1" described in line 161, we only need to handle the case of since we can ignore clusters that are smaller than (we are talking specifically about the case, but this could be generalized). Thus we only need to iterate over and this naturally leads to a query algorithm, which is for constant . We will add this to the paper.
It would be interesting to see if it is possible to get close to for general , and we don't immediately see how to achieve this using our current ideas. Thank you for the great question.
Thanks for the clarifications and the response. I do not have any further questions.
The paper gives results on clustering a dataset using "subset queries." This is a generalization of "same-cluster queries." The same cluster query asks whether two given elements belong to the same optimal cluster. The subset query asks how many different clusters the elements of a given subset span. There has been recent interest in clustering using same-cluster queries. This paper is an extension of this line of work. Clearly, O(n^2) queries are sufficient to cluster all the points. We would want to explore the possibility of making a much smaller number of queries. There are two sub-models to consider when discussing subset queries:
- Adaptive queries
- Non-adaptive queries. Adaptive queries allow the query algorithm to adapt new queries based on the response for the earlier queries, whereas, in the non-adaptive model, all subsets should be predefined before making any queries. Adaptive queries are more powerful since O(nk) adaptive same-cluster queries are sufficient, whereas O(n^2) are required in the non-adaptive setting. This paper gives results in the non-adaptive setting:
- An O(n log^2{n} log{k}) query algorithm for general values of k. This is using a reduction to a known problem in group testing. This is improved to O(n log{k}) for the balanced case (when cluster sizes are within a constant factor of each other).
- An O(n log log{n}) algorithm for constant values of k. This is using a simple randomized method.
优点
- From a purely theoretical point of view, the results are interesting. The discussion is elaborate, with various special cases discussed. Except for logarithmic factors, the bounds are tight.
- The writing is good. The intuition is clearly stated within the main write-up, and the proofs are in the appendix.
缺点
- The motivation for considering subset queries is not immediately clear. The motivation for same-cluster queries was easier to understand since one can set up a crowdsource platform and ask users to check whether two objects (say images) belong to the same class. Setting up subset queries for large subsets may be tricky since answering such queries is not simple.
问题
- Subset queries may not be easy to answer accurately, specifically when the subsets are large. However, answering such queries approximately (within some margin of error) may be possible. It may be interesting to explore whether some interesting clustering can be obtained using a small number of such "inaccurate approximate queries."
局限性
This work is theoretical, and there are no potential negative societal impact.
Thanks for the thoughtful evaluation of the paper. Please see the global response for the motivation of the model.
Subset queries may not be easy to answer accurately
This is a fantastic point. Clustering under noisy subset queries is a direction we're interested in exploring in follow-up work. Positive results under a meaningful/realistic noise model would create a more convincing justification for studying subset queries. We believe that our work is an important first step towards this.
Thanks for your response. I have no further questions. I will maintain my current score.
We sincerely thank all of the reviewers for their very thoughtful reviews and comments.
Motivation for Subset Queries.
As reviewer SHzL has pointed out, clustering with subset queries is a natural extension of the well-studied same-cluster queries. While with same-cluster queries we would require non-adaptive queries to recover the clusters even for small , here we show that just near-linear number of queries suffice with subset queries. To address the questions related to the practical motivation for the model, we provide a few justifications below. Rest of the questions have been addressed with a response to each reviewer separately.
- In the paper, we have considered non-adaptivity where queries are asked in one round as a major practical motivation. Several prior works have shown it is extremely time consuming to get query answers adaptively from a crowd. Subset queries is the only way to get good query complexity while considering non-adaptivity.
- As a reviewer pointed out, bounded size queries are more practical than unbounded size queries. We have studied the effect of query size for both of our algorithms. Our strategy 1 gives optimal dependency on query size.
- Our work serves as a stepping stone for future work to consider few rounds of adaptivity (partial answers have been provided in Appendix F), and where errors are allowed in query answers.
- As reviewer hP14 brings up, if there is a bandwidth constraint, then returning just a number is much easier. Suppose the algorithm sends its queries off to different entities. Suppose these entities have a bandwidth constraint in that they do not want to send back much information. Answering a subset query requires them to send only bits (just a number), instead of needed for describing the whole clustering on . The total number of bits sent is and the max number of bits sent by any entity is .
- If the model was instead that the query returned the entire clustering on , then the problem is trivially solvable with query when unbounded size is allowed. On the other hand, for bounded size queries, it does not change the information theoretic lower bound.
- Consider the problem of counting the number of clusters represented in vs computing the clustering in from the perspective of an entity answering a query. The counting problem is easier, because one does not have to resolve ambiguities in assignments. Therefore, even if assignments can be erroneous, counts are likely to be correct (hence more robust).
Again, thank you to all of the reviewers. We greatly appreciate the time and effort you've invested to carefully read and evaluate our work.
This paper studies the problem of reconstructing a clustering by asking queries to the oracle, and proposes a new model for such interaction. The main weakness of the paper, mentioned by all four reviewers, is an unrealistically strong assumption that the oracle can return the exact count of clusters in an unbounded size query. This is a quite serious concern especially given that what we get from this strong assumptions (compared to the more standard pairwise same-cluster queries) is the ability to ask queries nonadaptiviely and to still get a nontrivial upper bound. I'm not entirely convinced by the arguments brought in the rebuttal, but the reviewers seem to be very positive about the paper regardless, so I lean towards acceptance.