4- In the proof of Lemma 2, why should the global loss be minimized at zero and not some other non-negative value?

The assumption of Lemma 2 is that the global loss is minimized at zero. Obviously, being a sum of square, 0 is the lower bound of this term's values overall. The reviewer asks how we can guarantee that this lower bound is indeed reached. Our argument in this regard is the same as that used by FastSHAP [1]. That is, since ViaSHAP relies on an MLP to learn its values, the universal approximation theorem (Cybenko [2]; Hornik [3]) states that we can learn the Shapley values up to an arbitrary accuracy.

Nonetheless, to push this answer further, we also added a proof for a relaxed version of Lemma . We prove that, as the loss converges to , so do the attributed importances of non-influential criteria. In particular, if the loss has value , the importance attributed to a non-influential criterion is at most . The proof was added in the paper in the appendix, after the proof of Lemma 2.

In practice, it is unlikely for a loss to exactly reach its global optimum. Instead, it approximates it. We assume here that the loss has reached a value . We propose an upper bound on conditioned on .

Since the loss is composed only of non-negative terms, this means that:

Thus, in particular, we have the two following cases:

and

This give us:

Thus, as the loss function converges to , so does the importance attributed to features with no influence on the outcome.

Of course, this is a theoretical argument, which does not guarantee in practice that these results generalize well, or that the fit is perfect. This is why we provide extensive experiments to confirm empirically the validity and performance of our approach.

[1] FastSHAP: Real-Time Shapley Value Estimation, Jethani et al., ICLR2022.

[2] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303–314, Dec 1989.

[3] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2):251–257, 1991