Computational Explorations of Total Variation Distance
摘要
评审与讨论
This paper studies two fundamental questions about high-dimensional probability distributions.
- Is there a polynomial-time algorithm which can decide whether two mixtures of product distributions on the hypercube are equal to each other?
- Is there a polynomial-time algorithm which can accurately approximate the total variation distance between two Ising model distributions?
The paper answers (1) affirmatively, giving a very nice linear algebraic algorithm which can test equivalence between mixtures of product distributions on the cube, given access to the parameters of the distributions. The paper answers (2) negatively, showing that unless NP is contained in RP (that is, unless NP-complete problems have randomized polynomial time algorithms), the total variation distance between Ising models is inapproximable.
Strengths: The paper studies truly fundamental problems which will be of broad appeal to the ICLR audience. The results are convincing, and the equivalence-checking algorithm is very nice.
Weaknesses: These problems are so fundamental that it is a little hard to believe they are not already well studied. It would be nice if the related work section were expanded. It would be nice if the algorithm also meant you can test equivalence given samples.
Overall, I recommend accepting the paper to ICLR, on the basis that the results are about a very fundamental set of problems, and the equivalence testing algorithm is very clean.
Questions:
- Does result (1) mean you can also do equivalence testing from samples?
- Is the linear algebraic method you describe similar to any existing equivalence tester in the literature?
优点
see above
缺点
see above
问题
see above
Dear Reviewer,
Thank you for taking the time to review our paper and your constructive feedback.
We address your questions below.
- This is an interesting question. One approach will be to first learn the distributions from samples and then apply our algorithm. However, such an approach will not give an exact procedure, as learning from samples will necessarily incur some error.
- We are not aware of other equivalence checkers in the literature that use ideas similar to that of our first result.
The paper studies some properties about the complexity of computing the Total Variation Distance between distributions. The authors consider the case of the mixure of product distributions and the case of Ising models. In the first case they show a polytime algo that has access to the marginals and checks the equivalnece of two mixtures. For the second case they show hardness of approximating TVD by an FPRAS under the hypothesis NP not included in RP. The two results are sound and well written. However, the contribution is limited. I think neither of the two results are very significant alone and together the overall contribution is not very much more.
优点
The paper give two new results about computing TVD. The polynomial time algorithm for TVD of mixtures of product distribution has its main strength in the simplicity and clarity of the approach. The second result extends the studies on the complexity of dealing with Ising models.
缺点
The paper appears to be a gluing of two minor results with little connection between them. The second result builds upoon the Jerrum and Sincler previous analogous study. The first result bears more novelty, although it would not be, in my opinion, sufficient for a paper at ICLR. I think the main issue is with the specificity (very particular cases) of the two problems solved.
问题
The paper mentions some open questions but is there any chance that the techniques used, e.g., in the first theorem can be extended to other types of more general distributions? (I personally doubt it)
Dear Reviewer,
Thank you for taking the time to review our paper and your constructive feedback.
We are not sure whether the techniques used in the first theorem can be extended to other types of more general distributions. In fact, we pose it as an open problem for the mixtures of Bayes nets (with bounded treewidth). Please note that mixtures of product distributions are very general and natural. For instance, it is true that every distribution can be described by some (possibly large) mixture of product distributions. We have added this observation to our appendix (see Appendix A).
I have read the authors replies to all comments and reviews and I have slightly increased my rating. I still believe the contributi9n is limited.
Dear Reviewer,
thank you very much for increasing your rating!
The paper has two main contributions: (1) An algorithm for checking equivalence for mixtures of product distributions (2) A hardness reduction for approximating the total variation distance between two Ising models.
The problem of checking equivalence of product distributions is takes the following as input: we are given and where:
- Each of and is a product distribution over .
- are the weights for and are the weights for . So, the first distribution is the mixture of each weighted by and the second distribution is the mixture of each weighted by . The task is to determine wether the two mixture distributions are the same.
Here is an example (from page 2) an equal mixture of and equals to the distribution . This illustrates how a mixture of two product distributions can equal a different product distribution.
The paper gives a deterministic algorithm whose run-time is which is polynomial in the input size which equals . One idea the algorithm uses is that if the two mixtures and are equal, then (after a normalization) the mixtures and are also equal for every . The algorithm utilizes the idea by iteratively establishing equivalence for at every step establishing equivalence for assuming equivalence for (or to find some for which the probabilities of two distributions differ). As the paper shows, such steps can be achieved by keeping track of bases for solutions of certain types of equations.
Ising models are a fundamental family of probability distributions over . Probability of each in is proportional to where is a degree-2 multilinear polynomial. Ising models are a very well-studied family of distributions modeling systems for which random features have only pairwise interactions (this is because each term in has at most two variables).
A Fully Polynomial-time Randomized Approximation scheme FPRAS is a randomized algorithm that gives a multiplicative -approximation to a desired quantity. Here, this means that one is given a pair of degree-2 polynomials and describing a pair of Ising models, and the goal of the algorithm is to output a multiplicative -approximation to the TV distance between the pair of Ising models. The paper shows that no poly-time algorithm can achieve this task (under a basic complexity-theoretic assumption).
The hardness proof proceeds by developing an approximation-preserving reduction to the problem of approximating the partition function of an Ising model, which is known from previous work to be hard to approximate. As an intermediate step, the reduction goes through the problem of approximating the marginal of an Ising-model distribution on one of the coordinates (this is referred to as the problem of approximating the atomic marginal).
优点
- Mixtures of products are a fundamental family of probability distributions and checking their equivalence is one of the most basic questions about them.
- The algorithm uses an interesting novel idea of keeping track of bases for the solution spaces of certain equations. This idea might find applications for testing equivalence of other classes of distributions.
- The problem of estimating the total variation distance between two Ising model distributions is quite natural, as Ising models are a very well-studied class of probabilistic models.
- The hardness result only relies on the assumption that NP is not in RP, which is a very mild complexity assumption.
缺点
- The algorithm for mixtures of product distributions can only check whether P=Q exactly. The paper would be stronger if it gave an algorithm for approximating the distance between P and Q.
- The paper rules out FPRAS for TV distance between a pair of Ising models, but it seems that there could still be a constant-factor approximation algorithm, and the paper would be stronger if this question was also addressed (i.e. it was shown that this is also hard, or an algorithm was given).
问题
- Was a poly-time randomized algorithm for checking equivalence of mixtures of product distributions known previous to this work?
- To the best of your knowledge, what was known from previous work about approximating TV distance between Ising models?
- Do your hardness result tell us anything about hardness of computing an additive approximation for the TV distance between Ising models?
- Does the problem of estimating the TV distance between Ising models remain to be hard if one is given the values of the partition function and for each of the Ising models?
Dear Reviewer,
Thank you for taking the time to review our paper and your constructive feedback.
We will address your questions below.
-
We do not know of a randomized algorithm for checking the equivalence of mixtures of product distributions.
-
The body of work on TV distance computation is very recent. As we noted in our response to Reviewer Zs6J, the paper
Arnab Bhattacharyya, Sutanu Gayen, Kuldeep S. Meel, and N. V. Vinodchandran. Efficient distance approximation for structured high-dimensional distributions via learning. In Proc. of NeurIPS, 2020
considers additive approximation of TV distance between ferromagnetic Ising models. However, to the best of our knowledge, the problem of multiplicatively estimating TV distance between Ising models has not been studied before.
-
Our hardness result does not imply anything about the hardness of computing an additive approximation for the TV distance between Ising models. In general, the hardness of multiplicative approximation does not say anything about the hardness of additive approximation (for example, the number of satisfying assignments of a CNF formula is hard to multiplicatively estimate, but easy to additively estimate).
-
This is a very interesting question which we have not thought about. At this point, even if one is given the values of and it is still not clear whether TV distance in general is easy to estimate or not. Note that our proof is based on the hardness of approximation for the partition function (Jerrum and Sinclair, 1993).
Thank you for your response! My rating remains the same. Please make sure to include your answers above in the final version if accepted.
Dear Reviewer,
thank you very much!
We will make sure to include our answers above in the final version.
The paper discusses computational aspects of the total variation (TV) distance between probability distributions and has two contributions:
It provides a deterministic polynomial-time algorithm for testing whether two mixtures of product distributions are equivalent, which means the algorithm can decide whether the TV distance between them is zero.
It demonstrates that, unless NP is contained in RP, there cannot exist an efficient algorithm to estimate the TV distance between arbitrary Ising models. This result points out the computational hardness of this problem.
优点
The paper addresses important computational questions about the total variation distance, which is fundamental in probability and statistics.
The algorithm for equivalence checking of mixtures of product distributions is new and provides a practical solution to a non-trivial problem.This hardness result bridges complexity theory and statistical measures and provides insight into why certain computational tasks are hard.
The proofs are well written, the results are accessible.
缺点
It would be nicer if the paper could elaborate more on the practical applications of the equivalence checking algorithm with regard to performance on real-world data.
The hardness result could also be pushed further by thinking about the possibility of approximate algorithms with different complexity assumptions.
It would be even more applicable and helpful with more examples or case studies.
问题
Can the equivalence checking algorithm be extended to other type of mixtures, say Bayesian networks with bounded treewidth?
Is there any hope to soften the hardness assumption in estimating TV distance between Ising models by considering restricted classes of models?
Do the authors have some recommendations for practical algorithms that could approximate the TV distance between Ising models despite the hardness result?
Dear Reviewer,
Thank you for taking the time to review our paper and your constructive feedback.
We will address your questions below.
- This is a very interesting question. It is not clear to us whether the equivalence-checking algorithm can be extended to other types of mixtures, such as Bayesian networks with bounded treewidth. In fact, we have outlined this question as an open problem (in our Conclusion).
- Regarding our hardness assumption, please note that (that is, SAT does not admit any randomized one-sided-error poly-time algorithm) is a very standard assumption in hardness of approximation results. In fact, the well known result of hardness of approximation of the partition function of the Ising model is under the same assumption (Jerrum and Sinclair, 1993).
- Unfortunately, we do not have in mind any practical approximation algorithm to multiplicatively estimate the TV distance between Ising models. Regarding additive approximations for ferromagnetic Ising models, though, one may use the algorithm of the following paper (that is cited in our work):
Arnab Bhattacharyya, Sutanu Gayen, Kuldeep S. Meel, and N. V. Vinodchandran. Efficient distance approximation for structured high-dimensional distributions via learning. In Proc. of NeurIPS, 2020.
Thanks for your response! My rating remains the same.
Dear Reviewer,
thank you very much!
This paper addresses the problem of computing the total variation distance (TV-distance) between high-dimensional distributions from a computational standpoint. For distributions with compact descriptions, calculating TV-distance poses significant challenges, as the direct approach incurs exponential time complexity relative to input size. The authors present two main contributions:
(1) They provide a polynomial-time algorithm for determining whether two mixtures of product distributions are identical.
(2) They establish the computational hardness of approximating the TV-distance between two arbitrary Ising models.
The first result is based on a simple and clever algorithm, the second result comes from the standard hardness result of approximating partition functions of Ising models.
Due to the limited time for reviewing, I went through all the proofs and understood the main ideas; however, I did not verify every detailed calculation.
优点
The main contribution of this paper is the first result: deciding whether two mixtures of product distributions are the same. Suppose we have product distributions , where each is an -dimensional product distribution, i.e., is a vector . The paper takes the prefix of , namely for . This distribution is denoted by . Then, they consider the mixture of , denoted by . The algorithm decides whether and are the same for all . The algorithm is based on induction from to . The base case is trivial. The difficult part is that for and , the support of the distribution can be as large as . To reduce the computational cost, the algorithm finds a "sketch" of the two distributions. One needs to check whether for exponentially many . For each , the algorithm views as a linear equation. Instead of checking an exponential number of linear equations, the algorithm finds a basis of the linear system, and the size of the basis is . Then the algorithm only need to check the equations in the basis.
Overall, the algorithm and the definition of are simple and clever, and I think deciding whether two mixtures of product distributions are the same is a basic problem in statistics.
缺点
The hardness result follows from standard results. Proposition 6 provides a self-reduction for the Ising model, which is used in the standard counting-to-sampling reduction. Therefore, the proof of Proposition 6 could be omitted. Proposition 8 essentially states that one can fix the value of a vertex by adjusting the function , allowing the TV distance to encode the marginal distribution.
The relationship between Theorem 1 and Theorem 2 is not very strong, as they pertain to different models.
问题
I think the logic in lines 181–184 is incorrect, though the result is ultimately correct. The logic appears to use the implication to conclude , which is not valid. The result holds, however, because and , so if , then for at least .
For line 451, it should be . Additionally, some of the calculations in lines 441–458 are straightforward, so a few lines could be removed to streamline the presentation.
In the proof of Proposition 7, when setting parameters, it would be helpful to specify how small the parameter ( and ) needs to be to ensure that the relative error of the marginal probability remains smaller than . Could you clarify why machine precision is considered here? Would bits suffice to represent , where denotes the input size?
Dear Reviewer,
Thank you for taking the time to review our paper and your constructive feedback.
We will address your comments below.
- We took care of the logical error and typo pointed out. Thank you!
- Regarding the remark about Proposition 6, we have moved the proof of Proposition 6 to the appendix (see Appendix B).
- Regarding your question about Proposition 7, we have updated the proof to include more details about the choice of the associated parameters and . See the Lines 449 -- 485 of the revision.
Thank you for your reply. For the logic error, in your revision, you added a "if and only if" relation, which is necessary for your algorithm. I think it is true, but it would be good to have a few sentences to explain two directions, because it is important to the correctness of your algorithm.
Dear Reviewer,
thank you very much for your follow-up comment!
Please feel free to have a look at our new revision, whereby we implement your proposal.
Thank you for your reply. My rating remains the same.
Dear Reviewer,
thank you very much!
Dear Reviewers,
The authors have provided their rebuttal to your questions/comments. It will be very helpful if you can take a look at their responses and provide any further comments/updated review, if you have not already done so.
Thanks!
The total variation distance between multivariate distributions is a basic function that appear often in statistics and information theory. Computing TV distance between even simple product distributions can be difficult to understand. This paper first gives an algorithm for testing, and then provides a conditional lower bound for Ising models.
Weakness: appeal to ICLR audience may be less
Reviewers are positive about the paper and gave high scores. I recommend acceptance.
审稿人讨论附加意见
The discussions with authors seem to be precise and straightforward.
Accept (Spotlight)