PaperHub
7.0
/10
Poster4 位审稿人
最低6最高8标准差0.7
7
7
8
6
3.3
置信度
正确性3.8
贡献度3.5
表达3.3
NeurIPS 2024

Clustering with Non-adaptive Subset Queries

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06
TL;DR

We provide efficient non-adaptive algorithms that query a near-linear (and near-optimal) number of subset-queries to infer clustering of a set of elements exactly.

摘要

Recovering the underlying clustering of a set $U$ of $n$ points by asking pair-wise same-cluster queries has garnered significant interest in the last decade. Given a query $S \subset U$, $|S|=2$, the oracle returns "yes" if the points are in the same cluster and "no" otherwise. We study a natural generalization of this problem to subset queries for $|S|>2$, where the oracle returns the number of clusters intersecting $S$. Our aim is to determine the minimum number of queries needed for exactly recovering an arbitrary $k$-clustering. We focus on non-adaptive schemes, where all the queries are asked in one round, thus allowing for the querying process to be parallelized, which is a highly desirable property. For adaptive algorithms with pair-wise queries, the complexity is known to be $\Theta(nk)$, where $k$ is the number of clusters. In contrast, non-adaptive pair-wise query algorithms are extremely limited: even for $k=3$, such algorithms require $\Omega(n^2)$ queries, which matches the trivial $O(n^2)$ upper bound attained by querying every pair of points. Allowing for subset queries of unbounded size, $O(n)$ queries is possible with an adaptive scheme. However, the realm of non-adaptive algorithms remains completely unknown. Is it possible to attain algorithms that are non-adaptive while still making a near-linear number of queries? In this paper, we give the first non-adaptive algorithms for clustering with subset queries. We provide, (i) a non-adaptive algorithm making $O(n \log^2 n \log k)$ queries which improves to $O(n \log k)$ when the cluster sizes are within any constant factor of each other, (ii) for constant $k$, a non-adaptive algorithm making $O(n \log{\log{n}})$ queries. In addition to non-adaptivity, we take into account other practical considerations, such as enforcing a bound on query size. For constant $k$, we give an algorithm making $\smash{\widetilde{O}(n^2/s^2)}$ queries on subsets of size at most $s \leq \sqrt{n}$, which is optimal among all non-adaptive algorithms within a $\log n$-factor. For arbitrary $k$, the dependence varies as $\tilde{O}(n^2/s)$.
关键词
Clusteringquery algorithms

评审与讨论

审稿意见
7

[Setting] This paper studies the problem of clustering nn items into kk clusters using an oracle that can tell how many ground-truth clusters are represented in any given subset SS of items. The goal is to develop non-adaptive algorithms (where all queries are chosen before the oracle answers anything) and study their sample complexity.

[Contributions]

  1. Randomized non-adaptive algorithms for constant k that make:
    1. O(nkloglognlogk)O(n k \log \log n \log k) queries when subsets of any size can be queried
    2. O(nklognloglogn)O(n k \log n \log\log n) queries when S=O(n)|S| = O(\sqrt{n})
  2. Randomized non-adaptive algorithms for general k that make:
    1. O(nlognlogk(n/s+logs))O(n \log n \log k (n/s + \log s)) queries of size at most ss
  3. An Ω(min(n2/s2,n))\Omega(\min(n^2 / s^2, n)) lower bound for any non-adaptive algorithm that is allowed to query subsets of size at most ss.

The developed algorithms run in polynomial time.

Additionally, the paper also presents results for the case when :

  1. Clusters are roughly balanced (i.e., their sizes are within constant factors of each other). Here, the algorithm makes O(nlogk)O(n \log k) queries when k=O(n/log3n)k = O(n/\log^3 n) and O(nlog2k)O(n \log^2 k) queries for any kk.
  2. An adaptive round is allowed. This improves dependency on logarithmic factors.

For these results, the details are only in the appendices.

优点

  1. The paper is fairly comprehensive and presents query complexity results in a wide range of settings. The proposed algorithm are optimal up to logarithmic factors.
  2. The algorithms are well motivated. I especially appreciate the high-level intuition because of the clarity it adds to the paper.
  3. While I have some concerns about the problem setting (see Weaknesses), I believe the presented results are a very good first step towards filling an interesting research gap - non-adaptive algorithm for clustering with subset queries.

缺点

  1. My primary concern is with the feedback model. The oracle returns the number of clusters in the subset SS, which implies that it knows the correct grouping in SS. Why then would it only return the number of clusters in SS in practice and not the clusters themselves? Perhaps there are bandwidth constraints on response in some applications? But wouldn't one require way fewer queries if the oracle returns the clusters in SS, which alleviates the bandwidth concerns, especially in the non-adaptive setting?
  2. This is a relatively minor concern, but the paper defers a lot of details to the appendix without even including high-level ideas. I recommend including some details in the main paper, perhaps at the cost of a few technical details from Section 2. For example,
    1. What changes to the algorithm allow the query complexity to improve for the roughly balanced case?
    2. What is the "extra round of adaptivity"?

问题

  1. Please address point 1 under weaknesses.
  2. It seems to me that both Algorithm 1 and 2 work for any value of k. Can the authors elaborate on what they mean by constant kk and general kk? Do you have two algorithms because one of them has a better dependence on kk (O(logk)O(\log k) instead of O(klogk)O(k \log k))? On a similar note, how should one go about choosing between Algorithm 1 and 2 in practice?
  3. Your lower bound is based on the idea of converting subset queries to pairwise queries. Unfortunately, this hides the dependence on kk. Are you aware of any lower bounds in the pairwise case that depend on kk as well?

局限性

While it is true that all theorems have their assumptions listed, it would help the readers if the authors include a dedicated paragraph pointing out some of the research gaps that are left open by their work.

作者回复

Thanks for this constructive evaluation. Please see the global response for the question related to motivation. The rest of the responses are provided below. We will also add some high level ideas from the appendix to the main paper as suggested.

Can the authors elaborate on what they mean by constant kk and general kk

You are correct that both algorithms work for any value of kk. We have two algorithms because one of them has better query complexity when kk is small and the other has better query complexity when kk is large. Specifically, the algorithm of Theorem 1.2 makes O(nloglognklogk)O(n \log \log n \cdot k \log k) queries while Theorem 1.1 gives O(nlog2nlogk)O(n \log^2 n \log k). Thus, to be precise, Theorem 1.1 is better when kO(log2nloglogn)k \leq O(\frac{\log^2 n}{\log \log n}) and Theorem 1.2 is better otherwise. In practice, the algorithm should be chosen based on where kk falls with respect to this threshold. We wrote "constant kk" because we wanted to emphasize that Theorem 1.2 gives query complexity O(nloglogn)O(n \log \log n) when kk is a constant and "general kk" to emphasize that Theorem 1.1 gives query complexity O(nlog2nlogk)O(n \log^2 n \log k) for all kk. We will add a few lines in the write up to make this more clear.

Are you aware of any lower bounds in the pairwise case that depend on kk as well?

For non-adaptive pairwise query algorithms there is an Ω(n2)\Omega(n^2) lower bound for k=3k=3 which is as strong as possible (for any k3k\geq 3) since O(n2)O(n^2) is also a trivial upper bound for any kk. For k=2k = 2, O(n)O(n) is possible non-adaptively and this is optimal since the algorithm will need at least one pairwise query involving every point. Thus, this is the complete picture for non-adaptive pairwise query algorithms.

评论

Thank you taking time to respond to my questions. I appreciate the added motivation for subset queries and the clarification regarding constant-vs-general k. All the best for your submission !

审稿意见
7

The problem of cluster recovery via membership queries asks to assign each point in a dataset to its kk ground-truth cluster by using as few queries as possible to an oracle. A well-studied oracle is the same-cluster oracle that answers, given two points, whether the points are in the same or in different clusters. The complexity of adaptive algorithms for this problem is Θ(nk)\Theta(nk), but non-adaptive algorithms require Ω(n2)\Omega(n^2) for k3k \geq 3. The authors of this paper consider subset queries to surpass this barrier. Given a set of points SS, a subset query returns the number of clusters the points in SS belong to. It is known that the complexity of an adaptive algorithm is O(n)O(n). The authors consider non-adaptive algorithm and obtain the following results.

For general subset queries, they obtain an algorithm that makes O(nlog2nlogk)O(n \log^2 n \log k) queries for unbalanced clusters, and O(nlogk)O(n \log k) queries for balanced clusters. For constant kk, they obtain an algorithm that makes O(nloglogn)O(n \log \log n) queries. If the allowed subset size of the queries is bounded by sns \leq \sqrt{n}, their algorithms need an essentially optimal O~(n2/s2)\tilde{O}(n^2 / s^2) number of queries for constant kk, and O~(n2/s)\tilde{O}(n^2 / s) for arbitrary kk. Finally, they show that 2-round adaptivity can improve on the log factors in the complexity.

The algorithms follow two algorithmic approaches. For constant kk, the authors observe that one can use subset queries to identify an isolated point representing its cluster in a subset and explore the cluster from this point, or one can use subset queries that contain two points from the same cluster to sample intra-cluster edges. The two strategies have a reverse complexity with respect to the cluster size, and a combination of them give the final complexity trade-off. For general kk, the authors related subset queries to Boolean OR queries, and use these to recover clusters one by one.

优点

The paper provides a rather rich collection of results on foundational variants of the non-adaptive cluster recovery problem with near-linear complexity for unbounded subsets.

缺点

Unbounded subset size may be a rather strong assumption, and the results for bounded subset size of non-adaptive algorithms are not near-linear and almost trivial for constant subset size. One may be interested in few-round adaptive algorithms with bounded subset size.

问题

/

局限性

/

作者回复

Thank you for the very thoughtful review.

Unbounded subset size

We agree that unbounded subset size is a strong assumption and we would like to better understand the query complexity for bounded size in follow-up work. We have shown that O(n2s2klogn)O(\frac{n^2}{s^2} k\log n) is possible non-adaptively with sns \ll \sqrt{n} and this is essentially optimal in terms of nn and ss. It would be interesting to know if near-linear query complexity is possible with query sizes that are even smaller than n\sqrt{n}, say polylog(n){\rm poly}\log(n). We know this is not possible non-adaptively, but your idea of studying bounded subset size algorithms with few rounds of adaptivity is a great idea to look at in follow-up work.

评论

Thank you for your response! If I understand correctly, the details you provided agree with my understanding at large, and in particular, the bound is essentially optimal for "the largest possible" so(n)s \in o(\sqrt{n}) only (and not all 1sn1 \leq s \ll \sqrt{n}). If that's not the case, please follow up and feel free to let me know.

评论

Thank you for seeking clarification. We apologize if there were some ambiguities in the previous response.

Our result is optimal for all values of sns \leq \sqrt{n} within a lognloglogn\log n \log \log n factor.

In particular, our algorithm in Theorem A.1 (or the simplified version, 1.4) achieves O(n2s2klognloglogn)O(\frac{n^2}{s^2} k \log n \log \log n) non-adaptive queries for any sns \leq \sqrt{n}. Thus for constant k (which is the regime for Theorem A.1 and 1.4), our bounds are optimal for all values of sns \leq \sqrt{n} within the O(lognloglogn)O(\log n \log \log n) factor since n2s2\frac{n^2}{s^2} is a lower bound for non-adaptive queries.

So to summarize, it is not correct that the bound is optimal only for "the largest possible" so(n)s \in o(\sqrt{n}). Instead, the bound is optimal for all values of sns \leq \sqrt{n}.

评论

Sorry for the confusion, that was my bad, and thank you for the clarification!

Edit note for review: Taking the whole rebuttal into account, I think this paper should be accepted (change score 6 -> 7).

审稿意见
8

The paper studies the clustering problem with non-adaptive subset queries. The problem formulation is as follows. Suppose we are given an oracle q:VRq: \mathcal{V}\rightarrow \mathbb{R} such that for any query on a subset SVS\subseteq V of vertices, the oracle returns how many clusters are in SS in the optimal clustering. The goal is to recover the optimal clustering with subset queries that are as small as possible. The problem setting is similar to the clustering with same-cluster query problem, where the oracle returns whether two vertices (u,v)(u,v) are in the same cluster in the optimal clustering. Both problems are motivated by scenarios where we could use semi-supervised learning to obtain such types of oracles. However, in that setting, the number of necessary queries becomes Ω(n2)\Omega(n^2) when kk is at least 33. As such, to circumvent this strong barrier, it is natural to look into the new model with extra power for the queries.

In the new model, we can easily obtain an algorithm O(nlogk)O(n\log{k}) queries by binary search. However, such an algorithm requires at least Θ(logk)\Theta(\log{k}) rounds of adaptivity. The paper focused on the non-adaptive setting, where the queries have to be made independent of the results of previous queries. The main results include i).i). an algorithm that makes O(nloglogn)O(n \log\log{n}) queries when k=O(1)k=O(1) (linear dependency on kk), and; ii).ii). an algorithm that makes O(nlog2nlogk)O(n\log^{2}{n} \log{k}) queries for general kk. The paper also studied the case when the maximum size of SS is bounded and obtained tight bounds when Sn{|S|}\leq \sqrt{n}.

Techniques. The two main algorithms used different ideas. For the algorithm with constant kk, the key observation is that after recovering the biggest cluster(s) in the queried set, there are non-trivial probability that the remaining of the queried set has exactly one or two points. For the two cases, we can recover clusters via subset queries for some specific sizes. As such, we can repeat enough times, and use a ‘valid’ set to recursively recover the clusters in the reconstruction phase. For the algorithm with general kk, the paper reduces the problem to solving the OR queries of boolean vectors. In this way, the paper provides some new results for OR queries and obtains the desired bound.

优点

I have a positive view of the paper. In my opinion, the problem setting is well-motivated, and it is a natural extension of the well-studied same-cluster query model. The paper is well-written, and the authors did a nice job explaining the ideas in the high-level overview. Therefore, although the techniques are non-trivial, I managed to follow the ideas well. Given the page limits of the conference, the paper still managed to do fairly well in terms of organization.

缺点

I don’t see any major issue in the paper. Some minor issues to address:

  • It might be good to further clarify what ‘non-adaptive’ means in your paper. I think in your algorithm, you could still adaptively ‘add on’ to clusters after making the queries. For a moment I thought your algorithm could conduct logn\log {n} rounds in parallel to recover all clusters, which will be even stronger in the ‘non-adaptive’ notion.
  • The paper doesn’t contain experiments. I’m personally OK with it, but I think for NeurIPS you might want to justify why this is not an issue. I think AKB [NeurIPS’16] (‘clustering with same-cluster queries’) gives a nice motivation for the same-cluster oracle model, and maybe this paper should provide some similar justifications.
  • The dependence on kk should be mentioned after theorem 1.2. I think O(nk)O(nk) is really not that bad – I thought it was something like 2k2^k.

问题

Most of the questions are asked in the ‘weakness’ part. Two MISC questions:

  • Line 165: should it be ploglognp\leq \log\log{n}?
  • If we just want the clustering of an α\alpha fraction of the vertices, can we use an even smaller number of queries?

局限性

I do not see any potential negative societal impacts.

作者回复

Thanks for this very encouraging message. Please see the responses to your evaluation below.

It might be good to further clarify what ‘non-adaptive’ means in your paper

Thank you, that is a good point. Our notion of non-adaptivity is only that the queries are made in one round. I.e. queries are not made based on responses from previous queries. Once the query responses are obtained there is no adaptivity-esque restriction on how the algorithm recovers the clusters. We agree that this should be clarified, and will do so in the future version.

Experiments

Being a theoretical paper, we thought simulations will not add a lot of value to the paper; however we can certainly add simulation of our algorithm.

The dependence on kk

Thank you, that is a great point. We will add clarification regarding the dependence on kk directly after the theorem statement.

Line 165: should it be ploglognp \leq \log \log n?

No, here plognp \leq \log n is correct since δ\delta could be as small as 1/n1/n and we need 2p2^p to approximate 1/δ1/\delta within a constant factor for some pp. Note that in this paragraph we are only describing how this strategy leads to a O(nlogn)O(n \log n) query algorithm and the O(nloglogn)O(n \log \log n) comes by combining the two strategies as described in the subsequent paragraphs.

If we just want the clustering of an α\alpha fraction of the vertices, can we use an even smaller number of queries?

This is a great question. Identifying the cluster containing any one point requires Ω(logk)\Omega(\log k) bits and so determining the clustering of αn\alpha n points requires Ω(αnlogk)\Omega(\alpha n \log k) bits. Since a query gives O(logk)O(\log k) bits of information we require Ω(αn)\Omega(\alpha n) queries to cluster an α\alpha-fraction of the points. In particular, for any constant α\alpha, the Ω(n)\Omega(n) lower bound still holds.

To answer your question directly, we do think there is a simple way to modify our algorithm to give improved query complexity in this case. For instance in "Strategy 1" described in line 161, we only need to handle the case of δα/2\delta \geq \alpha/2 since we can ignore clusters that are smaller than α2n\frac{\alpha}{2}n (we are talking specifically about the k=3k=3 case, but this could be generalized). Thus we only need to iterate over plog2αp \leq \log \frac{2}{\alpha} and this naturally leads to a O(nlog1α)O(n \log \frac{1}{\alpha}) query algorithm, which is O(n)O(n) for constant α\alpha. We will add this to the paper.

It would be interesting to see if it is possible to get close to O(αn)O(\alpha n) for general α\alpha, and we don't immediately see how to achieve this using our current ideas. Thank you for the great question.

评论

Thanks for the clarifications and the response. I do not have any further questions.

审稿意见
6

The paper gives results on clustering a dataset using "subset queries." This is a generalization of "same-cluster queries." The same cluster query asks whether two given elements belong to the same optimal cluster. The subset query asks how many different clusters the elements of a given subset span. There has been recent interest in clustering using same-cluster queries. This paper is an extension of this line of work. Clearly, O(n^2) queries are sufficient to cluster all the points. We would want to explore the possibility of making a much smaller number of queries. There are two sub-models to consider when discussing subset queries:

  1. Adaptive queries
  2. Non-adaptive queries. Adaptive queries allow the query algorithm to adapt new queries based on the response for the earlier queries, whereas, in the non-adaptive model, all subsets should be predefined before making any queries. Adaptive queries are more powerful since O(nk) adaptive same-cluster queries are sufficient, whereas O(n^2) are required in the non-adaptive setting. This paper gives results in the non-adaptive setting:
  • An O(n log^2{n} log{k}) query algorithm for general values of k. This is using a reduction to a known problem in group testing. This is improved to O(n log{k}) for the balanced case (when cluster sizes are within a constant factor of each other).
  • An O(n log log{n}) algorithm for constant values of k. This is using a simple randomized method.

优点

  • From a purely theoretical point of view, the results are interesting. The discussion is elaborate, with various special cases discussed. Except for logarithmic factors, the bounds are tight.
  • The writing is good. The intuition is clearly stated within the main write-up, and the proofs are in the appendix.

缺点

  • The motivation for considering subset queries is not immediately clear. The motivation for same-cluster queries was easier to understand since one can set up a crowdsource platform and ask users to check whether two objects (say images) belong to the same class. Setting up subset queries for large subsets may be tricky since answering such queries is not simple.

问题

  • Subset queries may not be easy to answer accurately, specifically when the subsets are large. However, answering such queries approximately (within some margin of error) may be possible. It may be interesting to explore whether some interesting clustering can be obtained using a small number of such "inaccurate approximate queries."

局限性

This work is theoretical, and there are no potential negative societal impact.

作者回复

Thanks for the thoughtful evaluation of the paper. Please see the global response for the motivation of the model.

Subset queries may not be easy to answer accurately

This is a fantastic point. Clustering under noisy subset queries is a direction we're interested in exploring in follow-up work. Positive results under a meaningful/realistic noise model would create a more convincing justification for studying subset queries. We believe that our work is an important first step towards this.

评论

Thanks for your response. I have no further questions. I will maintain my current score.

作者回复

We sincerely thank all of the reviewers for their very thoughtful reviews and comments.

Motivation for Subset Queries.

As reviewer SHzL has pointed out, clustering with subset queries is a natural extension of the well-studied same-cluster queries. While with same-cluster queries we would require O(n2)O(n^2) non-adaptive queries to recover the clusters even for small kk, here we show that just near-linear number of queries suffice with subset queries. To address the questions related to the practical motivation for the model, we provide a few justifications below. Rest of the questions have been addressed with a response to each reviewer separately.

  • In the paper, we have considered non-adaptivity where queries are asked in one round as a major practical motivation. Several prior works have shown it is extremely time consuming to get query answers adaptively from a crowd. Subset queries is the only way to get good query complexity while considering non-adaptivity.
  • As a reviewer pointed out, bounded size queries are more practical than unbounded size queries. We have studied the effect of query size for both of our algorithms. Our strategy 1 gives optimal dependency on query size.
  • Our work serves as a stepping stone for future work to consider few rounds of adaptivity (partial answers have been provided in Appendix F), and where errors are allowed in query answers.
  • As reviewer hP14 brings up, if there is a bandwidth constraint, then returning just a number is much easier. Suppose the algorithm sends its queries S1,,SqS_1,\ldots,S_q off to qq different entities. Suppose these entities have a bandwidth constraint in that they do not want to send back much information. Answering a subset query requires them to send only O(logk)O(\log k) bits (just a number), instead of O(Silogk)O(|S_i| \log k) needed for describing the whole clustering on SiS_i. The total number of bits sent is O(qlogk)=O~(n)O(q \log k) = \widetilde{O}(n) and the max number of bits sent by any entity is O(logk)O(\log k).
  • If the model was instead that the query returned the entire clustering on SS, then the problem is trivially solvable with 11 query when unbounded size is allowed. On the other hand, for bounded size queries, it does not change the information theoretic lower bound.
  • Consider the problem of counting the number of clusters represented in SS vs computing the clustering in SS from the perspective of an entity answering a query. The counting problem is easier, because one does not have to resolve ambiguities in assignments. Therefore, even if assignments can be erroneous, counts are likely to be correct (hence more robust).

Again, thank you to all of the reviewers. We greatly appreciate the time and effort you've invested to carefully read and evaluate our work.

最终决定

This paper studies the problem of reconstructing a clustering by asking queries to the oracle, and proposes a new model for such interaction. The main weakness of the paper, mentioned by all four reviewers, is an unrealistically strong assumption that the oracle can return the exact count of clusters in an unbounded size query. This is a quite serious concern especially given that what we get from this strong assumptions (compared to the more standard pairwise same-cluster queries) is the ability to ask queries nonadaptiviely and to still get a nontrivial upper bound. I'm not entirely convinced by the arguments brought in the rebuttal, but the reviewers seem to be very positive about the paper regardless, so I lean towards acceptance.