PaperHub
5.5
/10
Rejected6 位审稿人
最低5最高8标准差1.1
5
5
5
5
8
5
2.8
置信度
正确性3.0
贡献度2.2
表达2.7
ICLR 2025

Measuring Diversity: Axioms and Challenges

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05

摘要

关键词
diversity measuredesirable propertiesaxiomatic approach

评审与讨论

审稿意见
5

This paper critiques existing diversity measures, illustrating their unreliability through examples. It introduces three axioms: monotonicity, uniqueness, and continuity which are essential for a reliable measure. None of the known measures satisfy all three properties. The authors propose two new measures that meet these criteria but are computationally complex. Future research is needed to develop a practical diversity measure that adheres to all three axioms.

优点

  • The research problem in the paper is meaningful.
  • The paper is well-organized and easy to follow.
  • The examples and analysis are detailed.

缺点

  • Some writing needs to be more clear.
  • Some undefined notations make part of the paper hard to understand.
  • Typos and grammar issues.

问题

  1. The authors claim that, "we formulate three desirable properties (axioms) of a reliable diversity measure: monotonicity, uniqueness, and continuity." And it is mentioned several times in the text that a reliable diversity metric should satisfy these three properties. To confirm, do the authors believe that it is a reliable metric as long as these three properties are satisfied? My question is, how is a reliable diversity metric defined? Do related studies support it? Does a reliable diversity metric only need to satisfy these three properties, and are there no other properties that need to be satisfied?

  2. Some undefined notations make the paper hard to understand.

  • On line 78 of page 2, what is the range of ii and jj, and is there a size relationship?
  • On line 90 of page 2, what is the meaning of tt? There is no explanation and no reference for it.
  • On line 364 of page 7, what is the Equation (7)?
  • Etc.
  1. Some writing needs to be more clear. For example, on lines 78-79 of page 2, the authors say that, "For generality purposes, we do not require the triangle inequality to be satisfied by dijd_{ij}." No more explanations. I do not know why.

  2. Diversity is a scientific topic that has been studied for a long time, but the paper has only just a few references. It is recommended that the authors add references when introducing each comparative metric. Moreover, compare more diversity metrics, such as the Gini index, Coverage, etc.

  3. Typos and grammar issues:

  • On line 87 of page 3, "(often referred to as Bottleneck and Diameter, respectfully)."
  • On lines 108-110 of page 3, "Some previous works on measuring diversity analyze and compare measures based on properties they do or do not satisfy. We review these works in Section 4.2." While on lines 251-252 of page 5, "Several papers analyzed and compared diversity measures in terms of properties they do or do not satisfy."
  • On line 252 of page 5, "For instance, Xie et al. (2023) formulates three axioms."
  • Etc.
评论

Thank you for your feedback!

To confirm, do the authors believe that it is a reliable metric as long as these three properties are satisfied? My question is, how is a reliable diversity metric defined? Do related studies support it? Does a reliable diversity metric only need to satisfy these three properties, and are there no other properties that need to be satisfied?

In the paper, we only claim that a reliable measure has to satisfy these properties and explain why each property is needed. We do not say that if a measure satisfies all three of them, then it is necessarily good. It can be the case that some additional properties are desirable for a good measure of diversity (for instance, for specific applications there can be specific requirements). However, we could not come up with more properties that are undoubtedly desirable for a general measure. Also, we do observe that none of the existing measures has all the properties and thus even this set of axioms is hard to satisfy.

On line 78 of page 2, what is the range of i and j, and is there a size relationship?

In lines 76-78 we write the following: “We assume that we are given a collection of nn (possibly duplicated) objects X=(x1,,xn)X=(x_1, \dots, x_n) and pairwise distances (dissimilarities) between them such that dij0d_{ij} \ge 0 and dij=0d_{ij} = 0 iff xix_i and xjx_j coincide.” It follows from the text above that the indices ii and jj are in the range [1,n][1,n]. Could you please clarify what you mean by size relationship?

On line 90 of page 2, what is the meaning of t? There is no explanation and no reference for it.

It is a parameter of the Circle measure (can be any positive value) and is used as a radius of circles, as explained in lines 91-93. We also refer to the corresponding paper in line 88.

On line 364 of page 7, what is the Equation (7)?

That is a typo, thanks for noticing. It should be Equation (2) instead (Equation (7) is the same expression, but in Appendix, page 12).

Some writing needs to be more clear. For example, on lines 78-79 of page 2, the authors say that, "For generality purposes, we do not require the triangle inequality to be satisfied by dij." No more explanations. I do not know why.

We could assume that the dissimilarities dijd_{ij} between the elements are formal distances, i.e., satisfy all the axioms of a distance measure. However, all our theoretical results work in a more general case, when we do not require the triangle inequality. Thus, we do not limit our analysis and consider a more general setup. Also, note that our formal setup with all assumptions is described in lines 224-243.

Diversity is a scientific topic that has been studied for a long time, but the paper has only just a few references. It is recommended that the authors add references when introducing each comparative metric. Moreover, compare more diversity metrics, such as the Gini index, Coverage, etc.

If we adapt the Gini index to our setup, we get a diversity measure that is equivalent to Average. All variants of the Coverage metric that we are aware of require all elements to be embedded in some space over which we can iterate (or integrate). For instance, Coverage of a set in a space can be defined as a Maximal/Mean/Median distance to the closest point from the set, measured over all points of the space. Another definition of Coverage is that given a set of points in a space and a desired coverage radius rr, we calculate the proportion of points of this space that are within a distance rr of at least one point from the set. Given only nn objects with pairwise distances, Coverage cannot be defined. For instance, even if texts (or graphs or other objects) are embedded in Rk\mathbb{R}^k, we do not know which area of Rk\mathbb{R}^k corresponds to space of valid embeddings, and thus can not iterate (or integrate) over space of valid embeddings. If you could refer us to a variant of the Coverage metric that applies to our setup or if there are any other suitable measures we will be happy to add them to our analysis.

Thank you for pointing us to some typos, we will update the paper accordingly.

评论

I have read the author's response, and some of my concerns have been alleviated. I will consider adjusting my score based on the author's revisions.

评论

Thank you for the response! Note that we’ve updated the paper taking into account the comments of the reviewers. We describe all the changes that we’ve made in our general comment.

评论

Dear Reviewer,

Thank you again for the response. Do you have any comments or questions regarding the updated paper?

评论

Thank you for your revisions. After careful consideration, I will keep my current score, as I believe it reflects my honest and thorough evaluation of the work in its current form. While I appreciate the effort and thought you put into the revisions, I feel that they do not substantially change the overall impact or quality of the work. I hope you understand, and I truly value the opportunity to engage with your work. Thank you again for your hard work and dedication!

审稿意见
5

The paper studies quantifying the diversity of a set of objects, such as samples in a dataset. The authors first give a thorough literature review of prior diversity measurement methods. Then, the authors propose three axioms for a reliable diversity measurement according to the literature review and find that existing methods are struggling to satisfy these three axioms. Finally, the paper shows two examples that satisfy all three axioms to prove the non-self-contradicting of these axioms.

优点

  • The paper suggests the requirement for better diversity measurement, which is a valuable problem and indicates strong motivation.
  • The paper is well-structured and easy to follow.

缺点

  • The paper would benefit from more extensive experiments that highlight the drawbacks of existing methods and demonstrate the necessity of the three proposed axioms. Such experiments would make the work more convincing.
  • Similar axioms have already been proposed in Leinster's work [1]. It would be helpful to clarify the differences between the axioms in Leinster's work and those presented in this paper.
  • Section 3 includes some overly intuitive descriptions, and the prerequisites for the drawbacks of existing methods are too strict, e.g.,
  1. In Lines 126-128, the situation where a dataset appears more diverse from a human perspective but is less diverse in actual measurements may be strongly related to the embedding used for computing distances or similarities. This could indicate an issue with the embedding rather than the diversity measurement method. Additionally, such a case seems relatively rare, and it might be reasonable to conduct experiments to verify its occurrence. 2) The description of the limitations of the existing method includes very stringent prerequisites (e.g., requiring 16 points to be located at the four corners of a unit square, as mentioned on line 142), which are difficult to justify as generally applicable. Additionally, the positions of these points in the feature space are closely related to the embedding computation.

Reference: [1] Leinster T, Cobbold C A. Measuring diversity: the importance of species similarity[J]. Ecology, 2012, 93(3): 477-489.

问题

It appears that the Vendi Score satisfies the second axiom (Uniqueness), as it meets the property of identical elements [2]. The underlying logic of identical elements seems to be similar to that of the second axiom in this paper.

Reference: [2] Friedman D, Dieng A B. The vendi score: A diversity evaluation metric for machine learning[J]. arXiv preprint arXiv:2210.02410, 2022.

评论

Thank you for the review and constructive feedback!

The paper would benefit from more extensive experiments that highlight the drawbacks of existing methods and demonstrate the necessity of the three proposed axioms. Such experiments would make the work more convincing.

Note that the proposed properties come with explanations of why they are required and with examples of measures that have undesirable behavior when not satisfying some of them. Also, for each previous diversity measure we show its drawbacks with empirical examples (Section 3). Could you please specify what kind of additional experiments you would find convincing?

Similar axioms have already been proposed in Leinster's work [1]. It would be helpful to clarify the differences between the axioms in Leinster's work and those presented in this paper.

Thanks for this reference, we will add a discussion of it to our paper.

Leinster et al. propose a measure of diversity and discuss its properties.

Let us consider the diversity of order qq in Leinster et al. (Equation (1)). To apply this measure to our setting, we consider all pip_i to be equal to 1/n1/n, thus the diversity of order qq becomes (i=1n(j=1nsij)q1)11q(\sum \limits_{i=1}^n (\sum \limits_{j=1}^n s_{ij})^{q-1})^\frac{1}{1-q} (up to a constant multiplier), where sijs_{ij} is a similarity score and qq is some constant not equal to 11 or \infty. It is easy to prove that the diversity of order qq is continuous and monotonic, but does not have the uniqueness property, we can provide the complete proof if required and we can also add this measure to our list of diversity measures in the paper.

Let us now discuss the properties listed in Leinster at al. Partitioning properties are not applied to our case since we consider the diversity only for a fixed number of objects. For Elementary properties: Symmetry corresponds to our requirement of diversity function to be permutation invariant, and Absent species and Identical species properties are not applicable in our case (since we consider nn objects with equal weight and not nn probabilities summing to 11). From the list of three properties in the subsection Effect of species similarity on diversity, the only property applicable in our case is Monotonicity, which is equivalent to our Monotonicity axiom. We will add this discussion to Section 4.2.

In Lines 126-128, the situation where a dataset appears more diverse from a human perspective but is less diverse in actual measurements may be strongly related to the embedding used for computing distances or similarities. This could indicate an issue with the embedding rather than the diversity measurement method. Additionally, such a case seems relatively rare, and it might be reasonable to conduct experiments to verify its occurrence.

Let us provide an intuitive example that shows that even having good embeddings, problems can be caused by a diversity measure itself. For instance, in the context of LLM generation, one can measure diversity as follows. Given several texts from an LLM as answers on the same prompt, we embed each text as a vector in some space (for instance, using a BERT-like model) and define the distance or similarity in this space (for instance, cosine similarity). After that, we can treat this collection of texts as a collection of points with pairwise distances, and measure its diversity using any measure discussed in the paper. Now, our examples in Section 3 can be interpreted in terms of LLM-generated texts. Consider, for example, an illustration in lines 162-170 of the paper. Then, the left figure may correspond to a situation when all 16 texts are on the same topic but are slightly different. In the second case, there are 15 texts that cover different topics but there is one text that coincides (or almost coincides) with one of the 15. Then, some of the measures (Energy, Bottleneck) rank the first case higher while it is clearly less diverse. Note that such an example is valid even if we have a good embedder.

评论

The description of the limitations of the existing method includes very stringent prerequisites (e.g., requiring 16 points to be located at the four corners of a unit square, as mentioned on line 142), which are difficult to justify as generally applicable. Additionally, the positions of these points in the feature space are closely related to the embedding computation.

As we write in line 144, the left configuration in this figure (16 points located at the four corners of a unit square) is the configuration that we get when optimizing Average for 16 points in the square. We expect this problem to be general and apply to more complex spaces: optimizing the average pairwise distance pushes the elements to the boundary of the region while keeping central areas empty.

It appears that the Vendi Score satisfies the second axiom (Uniqueness), as it meets the property of identical elements [2]. The underlying logic of identical elements seems to be similar to that of the second axiom in this paper.

Vendi Score does not have the uniqueness property as we prove in our paper (lines 575-585 in Appendix). The authors of the Vendi Score paper prove that this score does satisfy the identical elements property but this property differs from our uniqueness property. Identical elements property concerns combining probabilities of two identical objects, while our paper considers nn objects with equal weight (not nn probabilities summing to 11).

评论

Dear Reviewer,

Thank you again for your constructive feedback! We would like to know if our response addresses your concerns. Please note that we’ve also updated the paper taking into account the feedback from all the reviewers. In particular, we added the discussion of the properties from Leinster & Cobbold (2012) and added their measure to our analysis. Please see our general response for the list of modifications made. We hope that the updated paper and our clarifications address your concerns and will be happy to answer any further questions.

审稿意见
5

Summary:

This paper focuses on the concept of diversity and how to quantify diversity for a set of objects. The authors first systematically review prior studies in measuring diversity, showcasing that these diversity measures behave undesirably in some cases. Then, the paper proposes three desirable axioms of a reliable diversity measure: monotonicity, uniqueness, and continuity, followed by two examples of diversity measure having the desirable properties. However, the proposed measures are computationally expensive, preventing a practical use case.

优点

  1. The paper studies an interesting problem, measuring diversity of a set of objects, which can be broadly applied in various problems, e.g., image or molecule generation, recommender systems.

  2. A thorough analysis of existing diversity measures is provided, offering insights into the behavior of each measure.

  3. The three proposed diversity measuring axioms are easy to understand.

缺点

While the paper has some merits, it suffers from a key limitation in the method itself. The proposed diversity measures are NP-hard, making them computationally expensive as the authors acknowledge. Additionally, the lack of experimental results comparing the proposed method with prior studies raises questions about the effectiveness of the proposed method. Making the proposed method computationally tractable and including comparative experiments, therefore, would significantly strengthen the proposed approach.

问题

Please see the weaknesses.

评论

Thank you for your feedback!

While the paper has some merits, it suffers from a key limitation in the method itself. The proposed diversity measures are NP-hard, making them computationally expensive as the authors acknowledge.

The main goal of discussing MultiDimVolume and IntegralMaxClique is to show that the proposed properties (axioms) do not contradict each other. Thus, they rather serve as a theoretical contribution and we do not suggest using them in practice due to their complexity.

Additionally, the lack of experimental results comparing the proposed method with prior studies raises questions about the effectiveness of the proposed method. Making the proposed method computationally tractable and including comparative experiments, therefore, would significantly strengthen the proposed approach.

First, as we discuss above, our paper does not propose any method.

However, during the discussion period, we conducted additional experiments comparing MultiDimVolume and IntegralMaxClique to existing measures using the simple examples from Section 3. The results are the following:

  • The example in lines 138-148: for the left configuration MDV = 3.66, IMC = 7.41; for the right configuration MDV = 12.97, IMC = 35.67.
  • The example in lines 149-156: for the left configuration MDV = 1.41, IMC = 2.00; for the right configuration MDV = 12.97, IMC = 35.67.
  • The example in lines 162-170: for the left configuration MDV = 4.32, IMC = 3.96; for the right configuration MDV = 11.90, IMC = 30.35.
  • The example in lines 172-178: for the left configuration MDV = 0.1, IMC = 0.01; for the right configuration MDV = 4.46, IMC = 5.25.

(Here we slightly modified the formula of MultiDimVolume by raising each summand to the power of 2k(k1)\frac{2}{k(k-1)}. This does not affect our theoretical analysis or the empirical comparisons above but makes the expression more natural for using in practice, see our discussion with Reviewer 1pux.)

In all these examples, the right configuration is more diverse according to both measures, as desired. Thus, these measures do not have the drawbacks that we found for other measures.

评论

Thank authors for your response providing additional context to the paper. I acknowledged the theoretical aspect of this work.

As the paper title suggests, the goal is to measure the diversity. However, the proposed method's complexity is NP-hard, making it difficult to apply the proposed method to measure diversity of a set of objects. While additional experiments are available during discussion phase, they are based on only 6 examples, which might not reflect the intricate features of each diversity measure. Without rigorous experiments, it is difficult to understand the inner working of the proposed method as well as its effectiveness compared to prior studies.

Thus, I decide to keep my original score.

评论

Thank you for the response! Note that we’ve updated the paper taking into account the comments of the reviewers. We describe all the changes that we’ve made in our general comment. We would also like to mention that our title “Measuring Diversity: Axioms and Challenges” specifies that our work is about axioms for diversity measures and challenges of measuring diversity, thus we believe that it agrees well with the content of the paper.

审稿意见
5

The paper focuses on diversity evaluation, i.e, how to quantify diversity in a collection of objects. The authors first criticize existing diversity measures by claiming that they may either lead to unexpected evaluation results or degenerate solution when being optimized. To bridge the gap, they define three properties associated with a good diversity measure: monotonicity, uniqueness, and continuity.

Lastly, by analyzing two diversity measures that satisfy all three properties, but being NP-Hard, the authors pose the challenge of developing diversity measures that satisfy all three properties while being computationally manageable.

优点

  1. The proposed ideas are well justified. For example, the authors use a concrete counter-example in Appendix A to argue the necessity for the diversity function to be continuous.

  2. The paper is presented in a systematic manner and it is easy to follow the claims and analysis conducted by the authors.

缺点

  1. Analysis of existing diversity measures is limited to simple cases. It would be better to examine whether such analysis holds for more practical scenarios, such as diversity metrics in the recommendation task or NLG generations.

  2. No explanation provided to "why a measure satisfying all three axioms can effectively evaluate diversity or lead to good performance while being optimized."

  3. The contribution of the paper is somewhat limited, as it primarily critiques existing measures and introduces a new problem without offering concrete solutions to address it.

问题

  1. How do you concretely define "our intuitive perception of diversity"? I think you should discuss in the context of several specific tasks. For example, how do you define diversity in the context of generations from LLMs? What are the intuitions regarding diversity in such a context?
评论

Thank you for the feedback!

I think you should discuss in the context of several specific tasks. For example, how do you define diversity in the context of generations from LLMs? What are the intuitions regarding diversity in such a context?

In the context of LLM generation, one can define diversity in the following way. Given several texts from an LLM as answers to the same prompt, we want to measure how diverse the output is. We embed each text as a vector in some space (for instance, using a BERT-like model) and define the distance or similarity in this space (for instance, cosine similarity). After that, we can treat this collection of texts as a collection of points with pairwise distances, and measure its diversity using any measure mentioned in Section 2 of our paper, or using MultiDimVolume, or IntegralMaxClique.

Analysis of existing diversity measures is limited to simple cases. It would be better to examine whether such analysis holds for more practical scenarios, such as diversity metrics in the recommendation task or NLG generations.

Let us show how our examples in Section 3 relate to more practical scenarios. As described above, texts generated by an LLM can be represented as points in some space. Consider, for example, an illustration in lines 162-170 of the paper. Then, the left figure may correspond to a situation when all 16 texts are on the same topic but are slightly different. In the second case, there are 15 texts that cover different topics but there is one more text that coincides (or almost coincides) with one of the 15. Then, some of the measures (Energy, Bottleneck) rank the first case higher while it is clearly less diverse.

No explanation provided to "why a measure satisfying all three axioms can effectively evaluate diversity or lead to good performance while being optimized."

Note that we do not claim that any measure satisfying all three axioms is necessarily good. We claim that it is necessary for a reliable measure to have these properties but not necessarily sufficient. However, we do show that having a measure that satisfies these three axioms is already a challenging task.

How do you concretely define "our intuitive perception of diversity"?

This is the main research question of the paper - how to formalize our intuitive perception of diversity. In Section 3, we provide some examples of how we expect a good diversity measure to behave, and in Section 4.2 we formalize our intuition via three desirable properties. These are the properties that we expect a general diversity measure to satisfy (we show that measures that do not have some of the properties may have clearly unwanted behavior).

评论

Dear Reviewer,

Thank you again for your feedback and comments. We would like to know if our response addresses your concerns. Please note that we’ve also updated the paper taking into account the feedback from all the reviewers. Please see our general response for the list of modifications made. We hope that the updated paper and our clarifications address your concerns and will be happy to answer any further questions.

审稿意见
8

This paper presents a systematic examination of diversity quantification across different use cases and fields. The authors identify that existing diversity metrics fail to simultaneously satisfy three fundamental axioms: monotonicity, uniqueness and continuity. The authors then propose two theoretical measures, MultiDimVolume and IntegralMaxClique, which satisfy these axioms. This work makes a contribution by establishing formal axioms for diversity measurement and framing a crucial open problem: developing computationally efficient diversity measures that satisfy all three fundamental axioms.

优点

The theoretical rigor of this paper is commendable.

缺点

The main weakness is that both measures which are proposed in the paper prove computationally intractable due to their NP-hard nature.

问题

The paper might benefit from more rigor in addressing computational concerns earlier, particularly in a section discussing the limitations of applying these measures at scale. A discussion around the theoretical complexity of diversity measurement or potential computational techniques to approximate them would have provided a more balanced contribution.

评论

Thank you for your review and positive evaluation of our work. Regarding computational complexity, we agree that it is the main issue with MultiDimVolume and IntegralMaxClique. The main goal of discussing these two measures was to show that the proposed properties (axioms) do not contradict each other and thus we originally did not evaluate how MultiDimVolume and IntegralMaxClique perform in practice. However, as Reviewer 1pux mentions, such measures can still be used when the size of the output is small. Motivated by that, we also conducted a simple experiment and analyzed how MultiDimVolume and IntegralMaxClique rank the examples in Section 3. We see that in all the cases the ranking agrees with our expectations.

审稿意见
5

The paper discusses some theoretical aspects of various diversity measures. It suggests that common diversity measures, such as Vendi Score and Determinantal Point Process scores are optimized for computational simplicity rather than axiomatic optimality. They have potential drawbacks breaking some intuitive properties, such as monotonicity and uniqueness. The paper then proposes two measures: MultiDimVolume and IntegralMaxClique, which preserve all properties, despite being NP-hard to compute.

优点

Originality: The paper provides a systematic overview of commonly used diversity measures and gives concrete examples where these measures fail. The overview lays good foundation for the proposed methods.

Clarity: The proposed ideas are clearly presented and their connections to prior limitations are straightforward.

缺点

While I appreciate the unique angles that the paper takes to introduce the new NP-hard diversity measures, the measures themselves do not appear totally novel to me. They appear closely related to hypervolume-based multiobjective optimization, dating back to 2012. Additional discussions are needed to clarify the connections to existing work and to refresh the claims of contributions.

Another weakness is a lack of experiments. Contrary to the author's conclusion, an NP-hard objective can be practical if the number of candidates are few. For example, it is possible to compute the MultiDimVolume of Top-K recommended items from a recommender system, provided that K is in the range of 100-1000. For an ICLR contribution, I would expect to see some empirical validation of the proposed methods.

Lastly, the discussion of DPP objective can be strengthened. It is unclear to me whether the violation of the monotonicity property can be a result of improper normalization or is it a fundamental flaw of the DPP objective. More discussions about the construction of the K-matrices (Line 210) would be helpful.

问题

Regarding the comparison with hypervolume-based multiobjective optimization. Can the authors:

  1. Clarify connections to existing work
  2. Revise claim of novelty with respect to prior work

Regarding empirical evaluation. Can the authors validate an empirical comparison between the proposed methods and DPP or Vendi Scores. Here are some examples of empirical papers on the topic of diversity:

Regarding DPP discussion, please provide the sample points that are used to create the K matrices on Line 210.

  • Are the sample points themselves violating monotonicity?
  • What normalization steps are commonly used for DPP calculation?
  • Would diagonal regularization (adding an lambda x Identity matrix) alleviate some of the drawbacks?
评论

Thank you for the detailed review! We address the concerns below.

Regarding the comparison with hypervolume-based multiobjective optimization. Can the authors:

  1. Clarify connections to existing work
  2. Revise claim of novelty with respect to prior work

We think that this question refers to (Auger et al., 2012). This paper concerns getting an optimal distribution of several points maximizing the (weighted) hypervolume indicator. The resulting distribution is intuitively diverse (for instance if all points coincide the indicator is definitely not optimized), but there was no specific goal to maximize the diversity of these points and no definition of diversity was given. The authors do not claim that maximizing the hypervolume indicator does at the same time maximize diversity of the set of points. If you could give us (or refer to) a specific diversity function, we will investigate it.

Another weakness is a lack of experiments. … For an ICLR contribution, I would expect to see some empirical validation of the proposed methods

We consider our main contribution to be theoretical and we believe that such theoretical research fits the scope of ICLR since diversity is a concept that is widely used in machine learning. The main goal of discussing MultiDimVolume and IntegralMaxClique is to show that the proposed properties (axioms) do not contradict each other. However, following your suggestion, we do conduct a comparison of these measures with the existing ones (see below).

Contrary to the author's conclusion, an NP-hard objective can be practical if the number of candidates are few. For example, it is possible to compute the MultiDimVolume of Top-K recommended items from a recommender system, provided that K is in the range of 100-1000.

Thank you for this comment. We agree that NP-hard measures can indeed be applied to measure diversity when the output is small, we will discuss it in the paper. Because of that, we also plan to slightly modify the formula of MultiDimVolume by raising each summand to the power of 2k(k1)\frac{2}{k(k-1)}. This does not change our theoretical analysis and properties of this measure but makes the formula more natural since each summand is a product of k(k1)/2k(k-1)/2 distances.

Regarding empirical evaluation. Can the authors validate an empirical comparison between the proposed methods and DPP or Vendi Scores. Here are some examples of empirical papers on the topic of diversity …

It is hard to empirically compare two diversity measures since usually such measures are used to validate the results of some algorithms, and validating validation measures can be challenging. Thank you for providing relevant references. After considering their experimental setup, we see that an approach that can be applied in our case is a human evaluation that is used to decide which result is more diverse (as done in Carbonell et al.). Since our measures and analysis are general (domain-agnostic) it is natural to analyze how new measures perform for elements distributed in some space since in most applications objects are described by their vector representations and diversity is then computed for such representations. As a first step in this direction, we consider the examples from Section 3 and verify whether MultiDimVolume and IntegralMaxClique behave as desired in these examples. The results are the following (we use MultiDimVolume with a modification described above since this version seems to be more natural, but the conclusions for the original version are the same in all the cases below):

  • The example in lines 138-148: for the left configuration MDV = 3.66, IMC = 7.41; for the right configuration MDV = 12.97, IMC = 35.67.
  • The example in lines 149-156: for the left configuration MDV = 1.41, IMC = 2.00; for the right configuration MDV = 12.97, IMC = 35.67.
  • The example in lines 162-170: for the left configuration MDV = 4.32, IMC = 3.96; for the right configuration MDV = 11.90, IMC = 30.35.
  • The example in lines 172-178: for the left configuration MDV = 0.1, IMC = 0.01; for the right configuration MDV = 4.46, IMC = 5.25.

In all these examples, the right configuration is more diverse according to both measures, as desired. Thus, these measures do not have the drawbacks that we found for other measures. If you have any other examples for which we should test our measures, we will be happy to conduct such experiments.

评论

I thank the authors for their responses. I agree that the proposed MultiDimVolume has nothing to do with HyperVolume Indicator. However, does it really make sense to have the product of all pairwise distances as a diversity measure? Say, if I have three points that align on the same line, I would have pairwise distances of 2, 3, and 5. However, they form a linear relationship and having the positions of the two end points would greatly shorten the description length of the middle point. This violates my intuition that diversity is commonly a reflection of the amount of information being encoded.

On the other hand, the proposed IntegralMaxClique method could be a good metric. However, it also appears very different from the multiplicative nature of the first proposed metric (MultiDimVolume). This begs my question whether the proposed axioms are complete. If they were complete, the corresponding metrics should be more similar than their current forms.

Lastly, the authors could benefit from reorganizing the paper so that the proposed algorithms are defined before related work. This would give readers a clear overview of what is being discussed, rather than let them guess until the last few pages.

评论

Thank you for your involvement in the discussion!

However, does it really make sense to have the product of all pairwise distances as a diversity measure? Say, if I have three points that align on the same line, I would have pairwise distances of 2, 3, and 5. However, they form a linear relationship and having the positions of the two end points would greatly shorten the description length of the middle point. This violates my intuition that diversity is commonly a reflection of the amount of information being encoded.

Our notion of diversity is defined in terms of pairwise distances and is not directly aligned with information-based interpretation. For instance, we can consider three points that do not lay on the same line but are located very close to each other. They may require a large amount of information for encoding while we do not consider such a set to be diverse since all the elements are similar (close).

In your example with points laying on a line, we note that if we move one of the points away from this line, then diversity will be increased. In other words, if one wants a larger diversity, it is profitable to use all the available dimensions.

On the other hand, the proposed IntegralMaxClique method could be a good metric. However, it also appears very different from the multiplicative nature of the first proposed metric (MultiDimVolume). This begs my question whether the proposed axioms are complete. If they were complete, the corresponding metrics should be more similar than their current forms.

Our work does not guarantee that the set of axioms is complete and your comment that the two constructed measures seem to be very different is indeed valid. We do not guarantee that any measure satisfying all three axioms is necessarily good. It can be the case that some additional properties are desirable for a good measure of diversity (for instance, for specific applications there can be specific requirements). However, we could not come up with more properties that are undoubtedly desirable for a general measure. Also, we do observe that none of the existing measures has all the properties and thus even this set of axioms is hard to satisfy.

Note that we’ve updated the paper taking into account the comments of the reviewers. We describe all the changes that we’ve made in our general comment. In particular, we restructured Section 4 by moving the subsection ”Desirable properties in previous works” to the end of the section, so that we define our properties before discussing how they relate to the literature.

评论

Dear Reviewer,

Thank you for your feedback and involvement in the discussion. Please note that we’ve updated the paper taking into account the feedback from all the reviewers. Please see our general response for the list of modifications made. We hope that the updated paper and our clarifications address your concerns and will be happy to answer any further questions.

评论

Regarding DPP discussion, please provide the sample points that are used to create the K matrices on Line 210.

For the matrix K1K_1, consider three points A, B, C on a unit 2D sphere with pairwise spherical distances between A and B equal to arccos(0.6)= 0.927, between B and C equal to arccos(0.7)=0.795 and between A and C equal to arccos(0.2) = 1.369. The similarity is given by the cosine function. For the matrix K2K_2 we decrease the distance between A and C from arccos(0.2) = 1.369 to arccos(0.3) = 1.266, while keeping the distance between A and B unchanged, and the distance between B and C unchanged.

Are the sample points themselves violating monotonicity?

As discussed above, the points themselves violate monotonicity.

What normalization steps are commonly used for DPP calculation?

Would diagonal regularization (adding an lambda x Identity matrix) alleviate some of the drawbacks?

The informal intuition regarding the non-monotonicity of DPP is that the determinant is not a monotonically decreasing function of the non-diagonal matrix coefficients.

To the best of our knowledge, in the general DPP calculation setup, the only transformation of a matrix that is usually used before calculating the determinant is indeed a diagonal regularization. For small lambdas, the diagonal regularization won’t solve the non-monotonicity problem. For large lambdas the determinant will be approximately equal to (1+λ)n(1+λ)(n2)i>jsij2(1+\lambda)^n- (1+\lambda)^{(n-2)} \sum \limits_{i>j} s_{ij}^2, which is indeed monotone. But now it degenerates to some version of the Average measure.

评论

Dear Reviewers,

In our responses, we did our best to address the raised questions and concerns. Could you please tell us if you have any additional questions or comments? This will help us to properly address the raised concerns and update the paper according to your feedback. Thank you!

Sincerely, Authors

评论

We would like to thank the reviewers for their valuable feedback and suggestions. Following our discussion, we’ve updated the paper. The following changes have been made:

  • We added the measure Species(q) from Leinster & Cobbold (2012) to Table 1 and Table 2. This measure is introduced in lines 109-113. The example for its undesirable behavior w.r.t. comparison is given in lines 233-235. The properties are analyzed in lines 670-676.
  • We added the discussion of the properties from Leinster & Cobbold (2012) in lines 359-366.
  • We moved the subsection ”Desirable properties in previous works” to the end of Section 4, so that we define our properties before that.
  • We added the discussion of possible variations of MultiDimVolume in lines 396-404.
  • We added the analysis on how the measures constructed in Section 5 behave for the synthetic examples in Section 3: lines 427-431 in the main text and Appendix G.
  • We mention that NP-hard measures can still be used when the number of elements is small in lines 472-479.
  • We extended the proof for monotonicity counterexample for DPP in lines 646-652.
  • We fixed some typos in the text.
AC 元评审

The paper discusses some theoretical aspects of various diversity measures. Common diversity measures, e.g., Vendi Score and Determinantal Point Process scores, are optimised for computational simplicity rather than axiomatic optimal. They have potential drawbacks breaking some intuitive properties. In this paper, the authors propose two new measures, i.e., MultiDimVolume and IntegralMaxClique, which preserve all properties, despite being NP-hard to compute.

Overall, this paper introduces some new ideas about diversity measures. However, there is no empirical experiment analysis, comparing the proposed method with previous studies. Moreover, the proposed diversity measures are NP-hard, making them computationally expensive for application.

审稿人讨论附加意见

In the rebuttal, the authors have discussed the connections between this work and existing works, the novelty of the proposed method, and other questions raised in the review comments. Some concerns of the reviewers should have been addressed. However, their concerns regarding with the empirical analysis still remains.

最终决定

Reject