Performance Bounds for Active Binary Testing with Information Maximization
摘要
评审与讨论
This paper aims to provide a tighter bound for the well-known active binary testing with information maximization (Informax). The approach is similar to Garey and Graham (1974) or Loveland (1985) which uses a key assumption that the sequence of tests are -unpredictable to derive the minimum expected number of binary tests to predict a target variable .
优点
- For oracle binary tests, the proposed bound can improve the existing bounds for the same setting such as Garey and Graham (1974) or Loveland (1985). The most interesting contribution is to reduce to (cf. Theorem 1).
- Experiments on datasets CUB-2011 (20Q with birds) and AwA2 (20Q with animals) are provided, which demonstrate that the proposed bounds (for oracle tests) can be better than Garey and Graham (1974) or Loveland (1985)'s counterparts.
- The authors give a new bound for the noisy binary tests (cf. Theorem 4), and there haven't any existed bounds for this model.
缺点
- It is hard to think of how to design a binary test sequence which is -unpredictable although this is the key assumption to achieve results in this paper. Hence, the proposed bounds (for both oracle and noisy tests) do not guide us how to design an active binary test sequence based on Informax principle to achieve them.
- The tightness of the given bound also depends on . However, in general, it looks hard to find the optimal value of for an existing sequence of binary tests.
- In the two provided experiments, the authors assume that is uniformly distributed on some finite set , so . Therefore, the improvements of the authors' bound in Theorem 1 over Loveland's bound (cf. 6), which are shown in Fig. 2 (or Table 1 in Appendix), is mainly originated from a better control of constant factor (depending on ). The main interesting contribution that reduces to is not shown in these experiments.
- In Section 5, the authors mention some obtained bounds for noisy tests (via BSC), which are achieved based on the decomposition . The design of binary tests to achieve these bounds is based on having knowledge of (Lemma 2) or (cf. (1)), which looks very hard to obtain in practice. In addition, the tightness of these bounds is not verified in the paper.
问题
I don't have any question. Please see the weaknesses above and let me know if I misunderstand anything.
Some typos and improvements:
- Repetition in (27) and (28). Please remove the redundancy.
- You should mention that for any test in Section 3. This means that your results are limited to the binary test (yes/no questions).
Thank you for your time and feedback. We will address each of your comments (which are highlighted in bold) below:
The main interesting contribution that reduces to is not shown in these experiments.
Thank you for this suggestion to help improve the presentation of our results. We have added Figure 3 in the updated paper which supports our claim that as becomes more non-uniform (entropy reduces), our bound is much tighter than Loveland's previous bound.
The -unpredictable property does not guide how to design set of tests that satisfy this property for both the oracle and the noisy case.
We agree and this has been discussed as a limitation of our current work. As discussed in the Introduction, without any asumptions on the set of tests , it has been shown in prior work (Loveland, 1985) that the performance of the greedy strategy can be arbitrarily bad compared to the optimal strategy. However, this is often not what's observed in practice. In this work we identify -unpredictability to be a key property that practically employed set of tests used by the information maximization algorithm often exhibit (Geman et al., 2015). We then prove that assuming this property holds, the performance of the greedy strategy (in terms of the avg. number of tests needed for prediction) comes within a constant factor of entropy in the oracle case and within a constant factor of entropy plus a term that depends on the noise in the noisy case. We believe these results are nearly-optimal since entropy is a lower bound on the performance of any strategy for this problem and hope sheds some light on why the greedy strategy works so well in practice. Future work would try to develop testable conditions that could verify if a given is -unpredictable or not.
Repetition in (27) and (28). Please remove the redundancy.
Thank you for noticing this. This redundancy has been removed.
You should mention that for any test , in Section 3. This means that your results are limited to the binary test (yes/no questions).
This has already been mentioned 2 lines after equation 1 in Section 3 - .
This paper studies the problem of predicting a random variable using tests. Specifically, the authors analyse the commonly used greedy heuristic of information maximization under the assumption that the set of tests are -unpredictable. The main contribution of the paper is new upper bounds on the number of tests needed for information maximazation under both the oracle tests and the noisy tests. The obtained bound for oracle tests is tighter in certain regime of parameters than previous bounds, while the bound for noisy tests is the first such results.
优点
- The paper is very well-written. I really appreciate the authors for including proof sketch and discussion of high-level ideas, which makes the paper easy-to-follow even for readers that are not familiar of the problem of active testing.
- I think understanding the performance of greedy heuristic that has practical application is an important question. The obtained bound for oracle tests gives tighter guarantee than previous results in certain regimes of parameters and the paper presents detailed comparison with previous bounds. This paper is also the first to obtain bound for information maximization for noisy tests.
缺点
- As the authors comment the limitation section, the assumption that the tests are -unpredictable is not very useful in practice since it is not know how to compute the corresponding .
- The assumption of i.i.d. noise for noisy tests also limits the practical application of the results since the noise is often dependent on the value of and the tests outcomes are not independent.
问题
Does the authors have any insights in resolving weaknesses mentioned above?
Thank you for your time and feedback. We will address each of your comments (which are highlighted in bold) below:
As the authors comment the limitation section, the assumption that the tests are \delta-unpredictable is not very useful in practice since it is not know how to compute the corresponding .
We believe that there needs to be some assumption on the structure of the tests to be able to compute the corresponding . One such structure that we believe might be useful is a hierarchical structure between the tests. Specifically, say the tests are arranged in a hierarchy such that a test is true only if all its ancestor tests are true. Similarly, if a test is false, then all tests in the sub-hierarchy rooted at this test will be false. With such a structure, one can obtain several interesting relationships between the tests. For instance, one can immediately conclude that the conditioned on a set of test outcomes, the probability a test is true is always greater than the probability any of its children tests are true. While more work needs to be done to obtain precise relationship between the hierarchy and , we believe this to be a potential direction to investigate in resolving this weakness.
The assumption of i.i.d. noise for noisy tests also limits the practical application of the results since the noise is often dependent on the value of and the tests outcomes are not independent.
We agree with the reviewer, our noise model is meant to be a first step at analyzing the performance of the greedy strategy under noise. The major challenge with using more complicated noise models that depend on is that the mutual information term becomes analytically intractable. We believe variational approximations developed to estimate mutual information can offer tractable alternatives to analyze the greedy strategies' performance in such settings.
The paper considers the problem of identifying the value of a random variable through a sequence of binary tests. The paper focuses on studying the Information Maximization procedure where at each step we greedily choose the test that maximizes the conditional mutual information.
The paper first considers the case where the sets of allowed tests are deterministic functions (called oracle tests), and then studies the case where the output of the oracles are corrupted by noise. In both cases, the paper assumes that the set of tests satisfies a "-unpredictability property" where at each -th stage of the InfoMax procedure, it is always possible to find a test such that is at most away from 1/2 (unless of course we already identified at the desired accuracy). In other words, we can always find a test that approximately bisects the set of possible values of .
Assuming that the set of tests satisfies the -unpredictability property, the paper proves an upper bound on the expected number of tests needed to identify . The bound depends on and is proportional to the entropy . In the noisy case, the bound also depends on the noise-level .
优点
The problem considered in the paper is interesting, and the bound that is given is optimal up to constant factors since it is proportional to the entropy H(Y).
缺点
I have to admit that I am not very familiar with the literature of this topic in particular, but from an information-theoretic perspective, the novelty/contribution is a bit limited: The techniques used in the paper are very simple and the results are not too surprising.
问题
Did the authors consider extending the work to more general tests where a test consists of passing through a noisy channel of input alphabet (the set of possible value of ) and of output alphabet ?
Thank you for your time and feedback. We will address each of your comments (which are highlighted in bold) below:
I have to admit that I am not very familiar with the literature of this topic in particular, but from an information-theoretic perspective, the novelty/contribution is a bit limited
We are not sure why the reviewer feels the novelty of our work is limited from an information-theoretic perspective. We provide a bound on the performance of the greedy strategy that is within a constant factor of the entropy of , H(Y), which has not been established a priori in literature. Moreover, we show that our bound improves upon previous known bounds in literature. We provide more context for our work in the next paragraph.
It is well-known that when one has access to all possible binary functions of as tests, the greedy strategy needs at most tests to identify . However, in practical scenarios one never has access to all possible binary functions of . Take the example of disease diagnosis. refers to the disease of the patient and the tests are about different symptoms the patient might be experiencing. The set of all binary functions of (power set of the set of all diseases) is a much larger set than the set of possible symptoms a patient might experience from. In such restricted settings, there is no guarantee on the performance of the greedy strategy. This work is an effort to gain some theoretical understanding in such settings by imposing assumptions on the distribution of the set of tests and . The assumptions used in this paper are informed from the observation that often in practical scenarios the set of tests obtained is -unpredictable for modest values of , say between 0.15-0.3.
Did the authors consider extending the work to more general tests where a test consists of passing through a noisy channel of input alphabet (the set of possible value of ) and of binary alphabet 0,1
We hope that the results presented in our paper motivate future research into this direction where more general noise models are considered. This is discussed in the conclusion & limitations section of the paper. The main challenge with more complicated noise models beyond the i.i.d. assumption in this paper is that then the mutual information terms are no longer analytically tractable which complicates analysis.
Regarding the novelty of the work: I am not questioning that the result itself is new. My comment was mainly about the information-theoretic techniques that were used/developed and how sophisticated/advanced they are. From that regard, I find the proofs to be a bit simple and in that sense one would say that the information-theoretic contributions are a bit limited. This does not mean that the paper does not have other important contributions.
Having simple proofs is not necessarily a bad thing and in some contexts it can be a very good thing, for example:
If the problem has been extensively considered in the past and the research community had overlooked the simple proof. If the work introduces a new problem that is very interesting to pursue in its own right, and the introduction of the work introduces many open research questions which would be of interest for the community. However, I am not sure if these cases do apply to the paper and I cannot properly assess the other contribution aspects of the paper (other than the the sophistication of the proofs) because I am not very familiar with the literature of this particular topic that is studied, and this is why I added the disclaimer "I have to admit that I am not very familiar with ...". I invite the chair to take this into account. I only provided my review as an information theorist.
This paper deals with the problem of determining the value of a random variable Y, by adaptively performing a series of tests with binary output. The goal is to minimize the expected number of tests needed to determine the value of Y with the desired confidence. In the case where any possible binary test on the values of is available, this problem has been shown to be almost optimally solvable with at most tests, where denoted the entropy, via the information maximization strategy (i.e each time selecting the test with probability closest to ). However, the authors consider the more practical setting where a specific set of tests is available. The results of the paper identify sufficient conditions that this family of tests has to satisfy in combination with the expected number of tests needed. In particular, the notion of -unpredictability is considered for the test families, where is a measure of uncertainty for the test outcomes. The main result is that one can identify the value of Y using in expectation test from a -unpredictable family using the greedy information maximization strategy. This setting is also extended to the case where the test outcomes are noisy as a result of independently passing through a binary symmetric channel and a similar result is shown involving an additional parameter representing the target confidence for the value of .
优点
The paper deals with a fundamental problem from the perspective of more realistic and practical settings than the ones previously considered including a noise model.
缺点
There is no discussion about lower bounds on the number of tests needed for either the noisy or the oracle (noiseless) case. I believe one should be able to derive something using information theory, but it's not clear to me if those bounds would match the upper bounds in the paper.
The presentation could be improved since the results and contributions are not entirely clear form the introduction.
Minor cpmments -In Theorem 1 (and similarly for Theorem 4): The use of absolute value in the denominator is confusing since the expressing inside is always negative. I suggest using instead. -Page 3, "noisy tests" paragraph, line 7: By "pre-noise" did you want to say "de-noise"? -Page 4, line 17: the word "after" is probably missing after the word "or"
问题
- Are there any results for the case where all tests are chosen (non adaptively) in the beginning?
- Can the expression in Theorem 1 (and similarly Theorem 4) be written with respect to the entropy of a Bernoulli distribution? One would expect this because this seems to be the amount of information revealed with each test.
Are there any results for the case where all tests are chosen (non adaptively) in the beginning?
Yes, it is known that if the test outcomes are conditionally independent given , then a greedy selection based on mutual information (this is done in expectation without observing any outcomes unlike the adaptive algorithm studied in this paper) is known to produce a approximation to the optimal solution (Krause and Guestrin (2005)). This result is obtained by using the theory of submodular functions since the mutual information (in expectation over test outcomes) when test outcomes are conditionally independent given is a sub-modular function. Interestingly, mutual information when conditioned on actual test outcomes (not in expectation) is no longer submodular (Chen et al., COLT 2015), which is the case in adaptive strategies.
However, compared to fixed strategies, adaptive strategies like the information maximization algorithm are empirically found to be more efficient since subsequent choices of tests are now conditioned on outcomes observed so far. Nevertheless, this empirical experience is poorly understood in theory and we hope the results presented in our paper help shed some light on this matter.
Thank you for your time and feedback. We will address each of your comments (which are highlighted in bold) below:
There is no discussion about lower bounds on the number of tests needed for either the noisy or the oracle (noiseless) case. I believe one should be able to derive something using information theory, but it's not clear to me if those bounds would match the upper bounds in the paper.
Given a random variable , a lower bound on the average number of tests needed (regardless of the set of tests available) is given by the entropy of , . This means that no strategy (greedy or otherwise) can do better than this. This is mentioned in the second paragraph of the Introduction.
Our contribution is to show that when one has access to a set of oracle tests that satisfy a property called -unpredictability, then the average number of tests needed by the information maximization strategy is within a constant factor of , where the constant depends logarithmically on . We believe this upper bound to be nearly optimal given the entropic lower bound mentioned in the previous paragraph.
Finally, lower bounds specific to a given set of tests that are -unpredictable (for both the noisy and the oracle case) is an interesting direction for future research. The main difficulty in showing such lower bounds lies in the difficulty in constructing a set of tests that is -unpredictable for a given value of . This has been discussed as a limitation of our current work.
The presentation could be improved since the results and contributions are not entirely clear form the introduction.
We have rewritten our contributions in the Introduction section to make this more clear (annotated in blue). Our main contributions are as follows:
-
We first study the oracle case where tests are functions of . Assuming the given set of tests, , is -unpredictable for some , we prove that the greedy strategy needs at most number of tests on average to identify (predict) . To the best of our knowledge, this is a first bound on the performance of the greedy strategy that explicitly depends on the entropy of . This is desirable since a lower bound on the average number of tests needed for any given is given by the entropy of (Shannon, 1948). Moreover, we show that our bound is tighter than previously known bounds for oracle tests in practically relevant settings.
-
We then extend our analysis to the noisy case where we assume that test outcomes are corrupted via a binary symmetric channel. We obtain an upper bound on the performance of the greedy strategy that explicitly depends on and the noise level. Specifically, our bound in this case is again within a constant factor of the entropy of modulo an additional term, where the constant factor and the additional term depend on and the noise level. To the best of our knowledge, this is the first such result for the greedy strategy given noisy tests.
Minor comments -In Theorem 1 (and similarly for Theorem 4): The use of absolute value in the denominator is confusing since the expressing inside is always negative. I suggest using instead. -Page 3, "noisy tests" paragraph, line 7: By "pre-noise" did you want to say "de-noise"? -Page 4, line 17: the word "after" is probably missing after the word "or".
Thank you for these suggestions and we have implemented these changes in our updated paper.
Can the expression in Theorem 1 (and similarly Theorem 4) be written with respect to the entropy of a Bernoulli distribution? One would expect this because this seems to be the amount of information revealed with each test.
For Theorem 1 yes! In fact, a reviewer of an earlier version of this paper provided us with such a result which is reported in Theorem 6 (along with a proof) in the appendix. More specifically, the result says that the number of tests needed in the oracle case is upper bounded by , where is the entropy of a Bernoulli random variable. Since this was not our result we felt it would not be ethical to include it in the main paper.
Finally, it is not straightforward to extend that result to Theorem 4. This is because in the presence of noise, Eq. 14 (in the proof) is no longer valid, and requires further assumptions on the joint distribution of the test outcomes and than the stopping criterion assumed in this paper.
We thank all reviewers for their time, effort and valuable feedback that helped improve the presentation of this work. We are glad reviewers found our work very well-written (Reviewer NgKX), results novel and interesting (Reviewer fURR & NgKX). In light of the reviews, we have made the following changes to the paper (annotated in blue).
- Rewrote our contributions in the introduction to make them more clear as requested by Reviewer xWDX.
- Added an experiment to verify the improvement of our bounds over prior work in the scenario the distribution over (the labels) is not uniform as requested by Reviewer fURR.
This paper addresses the problem of predicting a binary variable via an information perspective. Despite the interesting setting and the information-theoretic approach, the paper can be further improved with better organisation and a clearer explanation on the theoretical bounds. The reviewer's concerns are not fully resolved; and the paper may not be ready to appear in the conference in its current form.
为何不给更高分
The organisation of the paper can be improved and the contribution and significance can be explained more clearly.
为何不给更低分
N/A
Reject