PaperHub
7.3
/10
Rejected4 位审稿人
最低4最高5标准差0.5
4
5
5
4
4.0
置信度
创新性3.5
质量3.3
清晰度3.3
重要性3.0
NeurIPS 2025

Curse of Slicing: Why Sliced Mutual Information is a Deceptive Measure of Statistical Dependence

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

Contrary to common belief, we demonstrate that Sliced Mutual Information is not a generally reliable measure of statistical dependence

摘要

关键词
information theorymutual informationsliced mutual informationcurse of dimensionality

评审与讨论

审稿意见
4

The paper demonstrates that sliced mutual information (SMI), recently considered a good alternative to mutual information (MI) in high dimensions, has multiple flaws. For instance, the paper demonstrates that SMI saturates as the correlation between random vectors increases, and that it prioritizes information redundancy instead of real information. The paper provides both theoretical and empirical results to enforce the statements.

优缺点分析

Strengths

  1. The paper is well-written. It flows nicely and it is easy to read.
  2. The paper provides an important contribution
  3. The contribution is both theoretical and numerical

Weaknesses

  1. The relevance of the contribution is relatively limited as SMI is not as famous as MI, copula, or other measures of statistical dependency
  2. The experimental results should be more extensive

问题

  1. Why do you use the KSG estimator for the experiments in Section 5? I would strongly suggest the inclusion of neural estimators like MINE, as KSG works poorly for high dimensions.
  2. In Figure 3 I can not explain to myself why the estimate of SI is so flat. Shouldn't it work well for moderate dimensions, as SI was specifically designed for high dimensions? (I do not consider d=8 high dimension).
  3. Since you are now showing that SI does not work well in high dimensions, what should researchers use when dealing with high dimensions? Is it preferable to estimate MI between representations of reduced dimensionality or to use SI? I think the paper would benefit from such an analysis.
  4. Can you please report all the steps for the proof of Lemma 4.1? Especially the one between line 553 and 554.
  5. I understand that for the Gaussian case the saturation can be proved in closed-form, and I guess that a similar situation would happen for non-Gaussian scenarios. However I think it would definitively increase the quality of the paper if the papers would analyze the following question: Does the phenomenon of saturation in MI estimation relate to the saturation of SI? I encourage the authors to write a small paragraph of comparison between saturation in MI and in SMI. I would strongly suggest the inclusion in the paragraph of the papers [a, b], which analyze this saturation problem from two different perspectives: [a] demonstrates the presence of an upper bound for lower-bound MI estimators, while [b] studies the impact of wrong data shuffling (permutations) on MI estimators. This would help the reader differentiate better these saturation phenomena.
  6. I think that the relevance of Section 4.1 is not very clear. Can you elaborate on why it is important?
  7. What is Bebeziana? I see you used it in the code but I do not understand where to find its description
  8. Does the << symbol mean absolutely continuous? The authors should explicitly state that.
  9. How do you practically implement the energy constraint in (4)?
  10. typo in "Proof of Lemma A.3". Should be A.4

[a] McAllester, David, and Karl Stratos. "Formal limitations on the measurement of mutual information." International Conference on Artificial Intelligence and Statistics. PMLR, 2020.

[b] Letizia, Nunzio Alexandro, Nicola Novello, and Andrea M. Tonello. "Mutual Information Estimation via ff-Divergence and Data Derangements." The Thirty-eighth Annual Conference on Neural Information Processing Systems. 2024.

局限性

yes

最终评判理由

I will keep my score to 4 as I believe the paper presents important results but I would have appreciated the inclusion on the paper of a deeper analysis to my Question 3.

格式问题

No concerns

作者回复

Dear Reviewer 4ZU6,

We sincerely thank you for your profound review. Below, we provide answers to the questions and concerns you raised.

Weaknesses:

  1. Although introduced only in 2021, SMI has rapidly gained traction due to its ease of implementation and parallels with other slicing methods (e.g., sliced Wasserstein distances). By revealing fundamental limitations of SMI, and slicing‑based measures more generally, our work clarifies when these approaches can mislead practitioners and suggests reconsidering slicing strategies. We thus expect it to inspire further research into efficient, projection‑based dependency measures across various fields.
  2. It is true that we focus on simple datasets, as they suffice to demonstrate the true scale of SMI's shortcomings. However, motivated by your feedback, we conducted four additional experiments using complex high-dimensional data from [r.a]. Due to space limitations, we are unable to include the results here. Please refer to 'Weakness 1' in our response to Reviewer GJD2.

Questions:

  1. We chose the KSG estimator as our baseline precisely because it requires no auxiliary optimization, a key advantage of SMI over neural estimators such as MINE, which rely on iterative network training. Once one adopts a more computationally involved backbone for MI estimation, the primary appeal of SMI diminishes. Moreover, in our experiments the dimensions remain within the practical regime for KSG to yield stable estimates, as also shown by the low variance. Additionally, we use SMI-MINE in Section 6.2 to show the redundancy bias for this method. Overall, the obtained empirical results support our theoretical findings.

  2. We share your confusion. The mentioned observation highlights the core "curse of slicing" uncovered by our work. Although one might expect sliced estimators to perform well, our theoretical and empirical results demonstrate that they can fail even at moderate dd.

  3. In light of our findings on SMI, one can recommend one of two strategies for high‑dimensional dependence estimation. The first one is indeed dimension‑reduction followed by MI estimation, i.e., we project the data onto a low-dimensional subspace that preserves most of the dependence (e.g., via PCA, CCA, or learned neural embeddings) and then apply a standard MI estimator (e.g., KSG) in that reduced space.

    However, if we still want to use the idea of slicing, then one may slice both the joint distribution P_X,Y\mathbb{P}\_{X, Y} and the product of marginals P_XP_Y\mathbb{P}\_{X} \otimes \mathbb{P}\_{Y} onto the same random subspace. This results in estimating D_KL(PQ)\mathrm{D}\_{\mathrm{KL}}(\mathbb{P}' || \mathbb{Q}') and in this way we overcome some drawbacks of SMI/kk-SMI (e.g., each slice now retains nontrivial dependence even when redundancy is low). The trade‑off is that one must employ KL‑divergence estimators for arbitrary distributions, rather than direct MI estimators.

    On the whole, we regard this important point as a direction for future work.

  4. We begin our proof by recalling the MI formula for two dd-dimensional Gaussian vectors: I(X;Y)=d2log(1ρ2)\mathrm{I}(X; Y) = - \frac{d}{2} \log(1 - \rho^2). By rotational invariance, for any fixed unit vectors θ,ϕSd1\theta,\phi\in\mathbb S^{d-1} the one‑dimensional projections satisfy (θTX;ϕTY)N(0,(1ρθTϕ ρθTϕ1)). \left(\theta^T X; \phi^T Y\right) \sim \mathcal{N} \left(0,\begin{pmatrix}1 & \rho \theta^T \phi \\\ \rho \theta^T \phi & 1 \end{pmatrix} \right).

    By the definition of SMI, SI(X;Y)=I(θTX;ϕTYθ,ϕ)=12E[log(1ρ2θTφ2)].\mathrm{SI}(X ; Y) = \mathrm{I}(\theta^T X; \phi^T Y \vert{} \theta, \phi) = - \frac{1}{2} \mathbb{E}[\log(1 - \rho^2 |\theta^T \varphi|^2)].

    Since θTφ2Beta(1/2,(d1)/2)|\theta^T \varphi|^2 \sim \mathrm{Beta}(1/2, (d - 1)/2) by Corollary A.5, we substitute its density and write in Eq. (6):

    SI(X;Y)=1B(12,d12)01log(1ρ2x)(1x)d32x12 dx\mathrm{SI}(X ; Y) = - \frac{1}{\mathrm{B}(\frac{1}{2}, \frac{d-1}{2})} \int_0^1 \log(1 - \rho^2 x) (1 - x)^{\frac{d-3}{2}} x^{-\frac{1}{2}} \ \mathrm{d}x

    Next, we use the identity ln(1z)=z _2F1(1,1;2;z)\ln(1 - z) = - z \ \_2F_1 (1, 1; 2; z), yielding

    SI(X;Y)=1B(12,d12)01x12(1x)d32 _2F1(1,1;2;ρ2x) dx\mathrm{SI}(X ; Y) = \frac{1}{\mathrm{B}(\frac{1}{2}, \frac{d-1}{2})} \int_0^1 x^{\frac{1}{2}} (1 - x)^{\frac{d-3}{2}} \ \_{2} F_1 (1, 1; 2; \rho^2 x) \ \mathrm{d}x

    We then simplify the Beta‐prefactor via the relationship between beta and gamma functions and the property Γ(z+1)=zΓ(z)\Gamma(z + 1) = z \Gamma(z): 1B(12,d12)=Γ(d2)Γ(12)Γ(d12)=2dΓ(d2+1)2Γ(32)Γ(d12)=1dΓ(d2+1)Γ(32)Γ((d2+1)32)\frac{1}{\mathrm{B}(\frac{1}{2}, \frac{d-1}{2})} = \frac{\Gamma(\frac{d}{2})}{ \Gamma(\frac{1}{2}) \Gamma(\frac{d-1}{2}) } = \frac{\frac{2}{d} \Gamma(\frac{d}{2} + 1)}{ 2 \Gamma(\frac{3}{2}) \Gamma(\frac{d - 1}{2}) } = \frac{1}{d} \frac{\Gamma(\frac{d}{2} + 1)}{ \Gamma(\frac{3}{2}) \Gamma((\frac{d}{2} + 1) - \frac{3}{2}) }

    At the same time we group the powers in the expression under the integral SI(X;Y)=ρ22dΓ(d2+1)Γ(32)Γ((d2+1)32)01x321(1x)(d2+1)321 _2F1(1,1;2;ρ2x) dx\mathrm{SI}(X ; Y) = \frac{\rho^2}{2d} \frac{\Gamma(\frac{d}{2} + 1)}{ \Gamma(\frac{3}{2}) \Gamma((\frac{d}{2} + 1) - \frac{3}{2}) } \int_0^1 x^{\frac{3}{2}-1} (1 - x)^{(\frac{d}{2} + 1) - \frac{3}{2} - 1} \ \_{2} F_1 (1, 1; 2; \rho^2 x) \ \mathrm{d}x

    and recognize the remaining integral as the Euler‐type transform, which takes the following form (recall that the generalized hypergeometric function is symmetric in its upper parameters aia_i and separately symmetric in its lower parameters bjb_j): _p+1Fq+1(c,a1,,ap;d,b1,,bq;z)=Γ(d)Γ(c)Γ(dc)01tc1(1t)dc1 _pFq(a1,,ap;b1,,bq;zt) dt.\_{p+1}F_{q+1}\left(c,a_1,\dots,a_p; d,b_1,\dots,b_q; z\right) =\frac{\Gamma(d)}{\Gamma(c)\Gamma(d - c)} \int_{0}^{1}t^{c - 1}(1 - t)^{d - c - 1} \ \_p F_{q}\left(a_1,\dots, a_p; b_1,\dots,b_q; z t\right)\ \mathrm{d} t.

    valid when Re(d)>Re(c)>0\mathrm{Re}(d) > \mathrm{Re}(c) > 0. Thus, we obtain SI(X;Y)=ρ22d _3F2(1,1,32;d2+1,2;ρ2)\mathrm{SI}(X ; Y) = \frac{\rho^2}{2 d} \ \_{3} F_2 \left( 1, 1, \frac{3}{2}; \frac{d}{2} + 1, 2; \rho^2 \right)

    To prove an upper bound we introduce the variable η=(1+θTφ)/2\eta = (1 + \theta^T \varphi) / 2, so that in the limit ρ1\rho \rightarrow 1 SMI becomes

    SI(X;Y)=12E[log(1θTφ2)]=12Elog(1(2η1)2)=12Elog4η(1η)=log2E[1η]\mathrm{SI}(X ; Y) = - \frac{1}{2} \mathbb{E}[\log(1 - |\theta^T \varphi|^2)] = - \frac{1}{2} \mathbb{E} \log(1 - (2\eta - 1)^2) = - \frac{1}{2} \mathbb{E} \log 4\eta (1 - \eta) = -\log 2 - \mathbb{E}[1 - \eta], where we used that η=d1ηBeta(d12,d12)\eta \overset{d}{=} 1 - \eta \sim \mathrm{Beta}(\frac{d-1}{2}, \frac{d-1}{2}) by Corollary A.5. Finally, the Beta‑moment identity E[logη]=ψ(α)ψ(2α)\mathbb{E} [\log \eta] = \psi(\alpha) - \psi(2\alpha), where ψ\psi denotes the digamma function, yields

    SI(X;Y)=ψ(d1)ψ(d12)log2\mathrm{SI}(X ; Y) = \psi(d - 1) - \psi(\frac{d-1}{2}) - \log 2

    To obtain an upper bound, we apply digamma inequalities log(x+12)1xψ(x)log(x+eψ(1))1x,x>0\log\left( x + \frac{1}{2} \right) - \frac{1}{x} \leq \psi(x) \leq \log(x + e^{\psi(1)}) - \frac{1}{x}, x > 0. Thus, we have

    SI(X;Y)log(d1+eψ(1))log(d/2)log(2)+1d1=log(1+eψ(1)1d)+1d1\mathrm{SI}(X ; Y) \leq \log\left( d - 1 + e^{\psi(1)} \right) - \log(d/2) - \log(2) + \frac{1}{d-1} = \log\left( 1 + \frac{e^{\psi(1)} - 1}{d} \right) + \frac{1}{d-1}

    To simplify the bound, one can note that eψ(1)1<1,log(1+x)<xe^{\psi(1)} - 1 < 1, \log(1 + x) < x, and 1/d<1/(d1)1/d < 1/(d-1)

  5. In our paper, we consider limitations arising in infinite sampling regimes (so the saturation arises solely from the inherent biases of SMI itself). However, both MI and sliced MI estimators built as distribution‑free lower bounds from NN samples are subject to the same (additional) finite‑sample limit, i.e., I^,SI^<logN\hat{I}, \hat{SI} < \log N with high probability ([a], [b]). This “saturation” of the estimates is a sampling‑limit phenomenon and applies equally to MI and its sliced average. Besides, we observe empirically that SMI saturates noticeably faster in practice, for instance in Figure 3 the estimate plateaus at roughly 22 nats even though lnN9.2\ln N \approx 9.2 for N=104N = 10^4 samples (indicating that the infinite-sampling limit is reached much sooner than the finite-sample limit). By contrast, the measures themselves behave very differently in high dimensions: the full MI grows linearly in dd, I(X;Y)=Θ(d)\mathrm{I}(X; Y) = \Theta(d), whereas the SMI decays as O(1/d)O(1/d) (Lemma 4.1).

    Additionally, we note that some estimators (e.g., MINDE [r.b] and MIENF [r.a]) are neither lower nor upper bounds and thus avoid this particular theoretical limitation in finite-sample regimes. These estimators, however, employ generative models and are several levels above SMI in terms of computational complexity.

  6. Section 4.1 extends our analysis to non‑uniform slicing methods (i.e., max‑SMI and optimal SMI) to show that the same redundancy bias persist even when projections are learned rather than chosen at random. We abbreviated this discussion because the majority of applications still rely on (averaged) SMI, but the key takeaway is that no variant of SMI escapes the "curse of slicing".

  7. This is a third-party plugin we used to seed random generators and gather experimental results. To avoid using it, please, replace bebeziana.seed_everything(seed, to_be_seeded) in source/evaluate/run.py with numpy.random.seed(seed)

    We are deeply sorry for the inconvenience. The supplementary material will be fixed to include this piece of code.

  8. Yes, we use "\ll" to denote an absolute continuity. Writing PQ\mathbb{P} \ll \mathbb{Q} means that whenever Q\mathbb{Q} assigns zero probability to an event, so does P\mathbb{P}. We will add this definition where the symbol first appears.

  9. We enforce the energy constraint via standardization and whitening. Standardization rescales each coordinate by D1/2D^{-1/2} with D=diag(Var[Xi])D=\mathrm{diag}(\mathrm{Var}[X_i]) and Dii1/2=Var[Xi]1/2D^{-1/2}_{ii}=\mathrm{Var}[X_i]^{-1/2}, while whitening replaces XX with Σ1/2X\Sigma^{-1/2}X where Σ=Cov(X)\Sigma=\mathrm{Cov}(X) and Σ1/2\Sigma^{-1/2} is computed via scipy.linalg.sqrtm.

  10. Thank you for catching this! It will be corrected.

We would like to once again sincerely thank Reviewer 4ZU6 for their work. If any concerns remain, we will be happy to address them as well.

[r.a] Butakov I. et al. "Information Bottleneck Analysis of Deep Neural Networks via Lossy Compression." Proc. of ICLR 2024"

[r.b] Franzese G. et al. "MINDE: Mutual Information Neural Diffusion Estimation." Proc. of ICLR 2024

评论

Thank you for your clarifications, now you solved many of my doubts. Where do you specify in the paper that you only consider an infinite sampling regime? (Question 5) I think that your answer to Question 5 is satisfactory but I would suggest you to include that discussion in the paper, as it would improve the clarity of the paper. I believe the paper contribution is important, but I would assign a rating of 5 only if the paper would include a deep analysis to Question 3, as I think that highlighting the SMI problem is important, but that providing a good/better alternative would be fundamental, as it is often done in ML papers with an intent similar to yours. Since you consider it only for future work, I will keep my rating at 4.

评论

Dear Reviewer 4ZU6,

We are glad that almost all the raised doubts have been addressed.

We assumed an infinite sampling regime implicitly (as we are using mathematical expectations instead of averages in our analysis), but we agree that an explicit discussion of the sampling regime will improve clarity. In the revised manuscript, we will state clearly that all of our theoretical analysis assume an infinite-sample regime, and we will add a short paragraph contrasting this with the finite-sample setting.

Regarding Question 3, our work does not introduce a new algorithm but instead reveals a fundamental trade-off: all existing slicing-based measures with computational simplicity comparable to SMI inevitably encounter the same limitations. Conversely, avoiding these issues requires adopting estimators of strictly higher complexity (e.g., neural or full KL-divergence estimators) along with their associated optimization and tuning overhead. We therefore caution practitioners that while slicing-based methods remain viable for efficiency, one must remain aware of these fundamental limitations and trade-offs.

That said, we recognize the value of complementing our findings with positive results. Introducing an alternative method could indeed strengthen our contribution. However, we remain overall skeptical about the feasibility of a new measure that retains SMI's advantages while significantly mitigating its limitations.

We would like to thank Reviewer 4ZU6 once again for further engaging in a discussion with us.

审稿意见
5

The paper critiques Sliced Mutual Information, highlighting its limitations as a scalable measure of statistical dependence. Key contributions include demonstrating that SMI saturates too quickly, prioritizes redundancy over information content, and suffers from the curse of dimensionality, which diminishes its effectiveness in high-dimensional data.

优缺点分析

Strengths: The paper provides a comprehensive theoretical analysis of Sliced Mutual Information, addressing its limitations in a structured manner.

Weaknesses: While the paper offers a solid analysis using Gaussian data, its findings may not fully generalize to more complex, highly non-linear datasets.

问题

  1. The paper primarily examines a specific covariance structure in Lemma 4.1. Are there other covariance structures or data distributions where SMI might perform better?
  2. The paper primarily addresses the population-level properties of SMI and max-sliced MI, but practical applications typically focus on sample-level performance, which may differ. For instance, in Proposition 4.7, when \rho_i = 0 for all i > k, MI and k-mSMI are equal. However, at the sample level, MI could be more sensitive to dimensionality. In such cases, it is possible that k-mSMI may perform better. Could the authors provide further discussion on this aspect?

局限性

Yes

最终评判理由

I keep my score since the paper adequately addresses the limitations of Sliced Mutual Information.

格式问题

None

作者回复

Dear Reviewer GJD2,

We sincerely thank you for the thoughtful questions you posed, which prompted us to clarify key points throughout.

Weaknesses:

  1. "It is true that we focus on simple datasets, as they suffice to demonstrate the true scale of SMI's shortcomings.

    However, motivated by your feedback, we conducted four additional experiments using complex high-dimensional data from [a]. This work proposes two distributions of synthetic images: one with high redundancy (images of Gaussian blobs, where each pair of pixels contains just enough information to reconstruct the latent variables used to generate the image) and one with low redundancy (images of rectangles, where only border pixels contain the information required to reconstruct the latent variables). For further details, please refer to the original article. We are deeply sorry that we are not able to provide pictures illustrating samples from these datasets due to the restrictions imposed by AC.

    Using this data, we conducted experiments by varying the ground truth MI (divided by the latent dimensionality) from 00 to 55 in steps of 11 and report mean SMI values. The rest of the experimental setup matches Section 5. The results are presented in the table below.

    MI / dlatentd_\text{latent}001122334455
    Gaussian images, 16×1616 \times 160.010.011.391.392.862.864.304.305.425.425.975.97
    Gaussian images, 32×3232 \times 320.010.011.391.392.862.864.304.305.425.425.985.98
    Rectangles, 16×1616 \times 160.080.080.810.811.401.401.681.681.741.741.751.75
    Rectangles, 32×3232 \times 320.160.160.460.460.750.750.920.920.970.970.970.97

    Note that in the high-redundancy case, SMI saturation only becomes noticeable around 55 nats. In contrast, for the rectangle images, saturation occurs much earlier (starting at 22 nats and reaching a plateau at 33-44 nats) and becomes even more pronounced as dimensionality increases.

    We will extend these results to higher MI values and higher projection dimensionalities and include them as an additional plot in our paper.

Questions:

  1. When the components of the random vector exhibit high redundancy, i.e., when the “effective” dimension is substantially smaller than the ambient dimension, kk‑SMI narrows the gap to the true MI. This is captured in Corollary 4.5, where we analyze a covariance matrix distinct from that of Lemma 4.1. In that setting, kk‑SMI coincide with MI; however, in such cases, the merits of SMI are eliminated completely, as it is essentially a low-dimensional regime.

  2. We thank the reviewer for highlighting this important point. In fact, where all dependence lies within a kk‑dimensional subspace, there is an implicit assumption that the data are intrinsically low‑dimensional from an information‑theoretic standpoint. This prior knowledge induces a bias in favor of methods that explicitly exploit low‑dimensional structure. Consequently, kk‑mSMI, by projecting onto the most informative kk directions, achieves substantially lower variance and improved sample efficiency than a generic MI estimator operating in the ambient space. In finite‑sample settings, this implicit bias translates into more accurate and stable estimates of dependence, making kk‑mSMI preferable whenever the true information content concentrates within a lower‑dimensional manifold. We emphasize, however, that this advantage is not unique to kk‑mSMI but arises for any estimator that leverages the same low‑dimensional inductive bias (e.g., [a,b]).

Thank you again for these perceptive inquiries. We hope our revisions address them fully, and we remain ready to address any additional questions you might have.

[a] Butakov I. et al. "Information Bottleneck Analysis of Deep Neural Networks via Lossy Compression." Proc. of ICLR 2024

[b] Gowri G. et al. "Approximating mutual information of high-dimensional variables using learned representations." Proc. of NeurIPS 2024

评论

Thank you for the detailed and thoughtful response. I found the clarifications and additional experiments helpful and satisfactory. I will keep my positive score.

审稿意见
5

This paper analyses the disadvantages of SMI, both theoretically and experimentally. SMI has premature saturation, redundancy bias, and is cursed by dimensionality. These insights might help clarify the collapse behavior of SMI in applications.

优缺点分析

Strengths: This paper investigates the disadvantages of SMI and yields insightful conclusions. The counterexamples illustrating the collapsing behavior of SMI are commendable.

Weaknesses: 1)Line 84, the expectation notation E should have a subscript description of PX,Y for the sake of clarification. 2)Line 92, it is unclear what W, \theta, and A are, and where they are used. It is better to give definitions and some explanations here. It seems that W and A are only elaborated in the Appendix. 3)The letter d is used a lot in your equations, with different meanings (dimensionality, Radon-Nikodym derivatives). Maybe it is better to represent the dimensionality with a different letter to avoid the confusion.

问题

1)In the abstract, the authors mentioned that SMI can perform worse than correlation coefficients. However, it is not obvious which example illustrates this point. I haven’t found examples comparing SMI and correlation coefficients directly in your paper. It seems that the curse of dimensionality issue of SMI can make it worse than correlation coefficients. But is this your example? If so, where is your figure/plot? Or is it just a theoretical example? 2)Line 68, the authors claim that one of the contributions of the paper is to reestablish the trade-off. The problems of SMI can be solved by non-uniform/non-random slicing. However, I didn’t find such content in your paper at all. Did I miss anything? 3)Line 148, the authors claim that SMI can be calculated analytically for a pair of jointly Gaussian random vectors. However, in the original paper of SMI, in its Example 1, it shows that SMI can only have a maximum value in such a scenario. The authors of this paper derived the analytical solution in Appendix B. However, I think there is a gap in the proof. How does the Jacobi ensemble in Lemma A.4 affect the distribution of |\theta^T \phi|^2 in Corollary A.5? Without a more rigorous explanation, I cannot fully trust the proof. 4)In line 194, the authors claim that SMI has a strong bias towards information redundancy. However, this claim is not obvious without referring to the Appendix. How do Proposition 4.4 and Corollary 4.5 highlight this bias? Is it because SMI increases while MI decreases? Please explain more.

局限性

yes

最终评判理由

All my questions have been resolved by the rebuttal. Since the conclusion of this paper can have a high impact on the field, I have increased my rating to 5.

格式问题

N/A

作者回复

Dear Reviewer uLwu,

Thank you sincerely for reviewing our article! In the following text, we provide responses to your concerns. We hope that all of them are addressed properly.

Weaknesses:

  1. We will add the subscript E_P_X,Y\mathbb{E}\_{\mathbb{P}\_{X, Y}} to the expectation to make explicit the underlying joint distribution.

  2. We denote by W\mathrm{W} an orthogonal matrix uniformly distributed according to the Haar measure on the orthogonal group O(d)\mathrm{O}(d), by θ\theta a unit vector uniformly distributed on the unit sphere Sd1\mathbb{S}^{d-1}, and by A\mathrm{A} a matrix with orthonormal columns sampled uniformly from the Stiefel manifold St(k,d)\mathrm{St}(k, d). We will state them explicitly in both the main text and the appendix for clarity.

    In practice, we generate W\mathrm{W} by sampling a d×dd \times d matrix with i.i.d. standard Gaussian entries, performing a QR decomposition, and taking the orthonormal Q\mathrm{Q} factor (this is also shown in the proof of Lemma A.4). To obtain θ\theta, we sample a dd-dimensional vector with i.i.d. standard Gaussian entries and normalize it to have unit lenght. For A\mathrm{A}, we draw a d×kd \times k matrix with i.i.d. Gaussian entries, perform a QR decomposition, and use the first kk columns of the resulting orthonormal Q\mathrm{Q} matrix.

  3. To distinguish the two uses of "d", we have typeset the dimensionality d*d* in italic and the differential operator d\textrm{d} in roman; nonetheless, we will consider the dimensionality symbol with an alternative letter in the revised manuscript.

Questions:

  1. Thank you for pointing this out. This claim is in fact a direct consequence of Lemma 4.1. There we consider two jointly Gaussian dd-dimensional vectors (X,Y)(X, Y) with (component‑wise) correlation coefficient ρ(1,1)\rho \in (-1, 1). We prove that, for any fixed ρ\rho, limdSMI(X;Y)=0\lim_{d \rightarrow \infty} \mathrm{SMI}(X; Y) = 0, whereas the standard linear correlation ρ|\rho| remains constant. Hence, in high dimensions SMI reports vanishing dependence even though a nonzero correlation persists. We will make this remark explicitly in the manuscript.

  2. In the “Suboptimality of Random Slicing” paragraph (Section 3) and again in Section 4.1, we discuss max‑SMI and optimal SMI, both of which employ non‑uniform slicing via an inner maximization over projections. While these methods mitigate some saturation effects of random SMI, they continue to suffer from redundancy bias and incur a substantial increase in computational cost by requiring an auxiliary neural network. This contrast highlights our core conclusion: there exists an inherent trade‑off between estimator complexity and the capacity to capture rich, high‑dimensional dependencies.

  3. In Example 1 of the original SMI paper, the authors consider two Gaussian vectors with an arbitrary covariance matrix, which prevents derivation of a closed‑form SMI. In that setting one can only upper‑bound the integral by maximizing over all one‑dimensional projections, yielding a simpler expression in terms of the canonical correlation coefficient (CCA). However, this bound does not coincide with the exact SMI and is not tightly related to CCA; the asserted tight connection to CCA in the original work is unfounded, as we demonstrate. To obtain a tractable analytic result while still illustrating peculiar properties of the SMI, we therefore restrict ourselves to a simpler block‑diagonal covariance structure.

    To clarify the connection between the Jacobi ensemble (Lemma A.4) and the distribution of θTϕ2|\theta^T \phi|^2, where θ,φ\theta, \varphi are independent and uniform on the sphere Sd1\mathbb{S}^{d-1}, first observe that this random variable (as well as (1+θTϕ)/2(1 + \theta^T \phi) / 2) arises in the proof of Lemma 4.1. The Corollary A.5 incorporates the result for the distribution of this inner product under two transformations. We outline two possible ways to justify the claimed distribution and thus establish its relation to the Jacobi ensemble.

    Lemma A.4 shows that, for a uniformly‐distributed orthogonal matrix W\mathrm{W}, the product W11W11T\mathrm{W_{11}} \mathrm{W_{11}^T}, where W11\mathrm{W_{11}} is the upper-left k×kk \times k block of W\mathrm{W}, follows a Jacobi ensemble with parameters a=1/2a = 1/2 and b=(d1)/2b = (d - 1)/2. Concretely, setting k=1k=1 in the Lemma A.4 implies that the first column w1\mathrm{w_1} of W\mathrm{W} satisfies w1μSd1\mathrm{w_1} \sim \mu_{\mathbb{S}^{d-1}}, and establishes that w112\mathrm{w_{11}^2} has density propostional to λ1/2(1λ)(d1)/21\lambda^{-1/2} (1 - \lambda)^{(d-1)/2 - 1}, i.e., w112Beta(12,d12)\mathrm{w_{11}^2} \sim \mathrm{Beta}\left( \frac{1}{2}, \frac{d-1}{2} \right).

    Since w1\mathrm{w_1} and φ\varphi are i.i.d., it follows that w1Tφ=dw1Te1=w11\mathrm{w_1^T} \varphi \overset{d}{=} \mathrm{w_1^T} e_1 = \mathrm{w_{11}} in distribution, so we recover that θTφ2Beta(12,d12)| \theta^T \varphi |^2 \sim \mathrm{Beta}\left( \frac{1}{2}, \frac{d-1}{2} \right).

    An equivalent derivation starts with the distirubion of the inner product u=θTφu = \theta^T \varphi (which is p(u)=(1u2)(d3)/2/ Beta((d1)/2,1/2),u[1,1]p(u) = (1 - u^2)^{(d - 3)/2} / \ \mathrm{Beta}((d - 1) / 2, 1/2), u \in [-1, 1]) and then applies the change of variables v=u2v = u^2, recovering the same result.

    We will incorporate a more detailed explanation in the proof.

  4. By "redundancy" we mean the replication of the information across multiple axes. More formally, redundancy corresponds to low differential entropy. As the covariance matrix becomes increasingly ill‑conditioned, the differential entropy diverges to -\infty, indicating a growth in redundant information. Proposition 4.4 shows that embedding any two variables into a higher‑dimensional space via a full‑column‑rank linear map leaves k‑SMI exactly equal to the original MI. Corollary 4.5 then considers the extreme case of a rank‑one Gaussian covariance, i.e., all coordinates carry identical information. It demonstrates that k‑SMI remains at its maximal value 12log(1ρ2)-\frac{1}{2} \log(1 - \rho^2), whereas classical MI strictly decreases when information is duplicated. Consequently, unlike MI, which penalizes repeated information, SMI "rewards" redundancy by preserving or boosting its value.

    Our experiments in Section 6.2 also confirm this bias, showing that learning representations by maximizing SMI produces collapsed, highly redundant embeddings rather than decorrelated features.

The clarifications will be added to the upcoming revision of our manuscript. We hope that you will find our answer satisfactory. If there are some parts of our response which require further clarification, please, let us know.

We again sincerely thank Reviewer uLwu for carefully reading our article and pointing out the parts of our theoretical framework which require additional explanation.

评论

I would like to thank the authors for the response. Most of my questions have been answered. I have an additional question regarding Question 3. Since you mentioned Lemma 4.1, I notice that there is no derivation for the analytical SI calculated on line 148. It is just given. It will be helpful for the readers to know how this analytical solution is obtained. This derivation can possibly be included in the appendix. If you can explain in the response now, that will also be great.

评论

Dear Reviewer uLwu,

We are glad that we were able to address most of your concerns. Regarding the remaining question, we suspect there may have been a typo in your reply.

As you noted in your review, we derive the analytic expression for SI in Appendix B (specifically, lines 547-559). However, if you were referring to the expression for the plain mutual information I(X;Y)I(X;Y), we derive this in Appendix A, Lemma A.1 and Corollary A.2, where we cite [1] for the Gaussian vector entropy formula. These results are later referenced in the proof of Lemma 4.1 (lines 547-548). We will make this connection clearer in the main text to avoid any confusion.

Below, we also provide an extended proof of this result, which also includes explicit derivation of h(N(μ,Σ))\mathsf{h}(\mathcal{N}(\mu,\Sigma)). If there are parts of this extended derivation which you think should be included in the text, please, let us know.

Step 1 (general case nn-dimensional Gaussian):

h(N(μ,Σ))=h(N(0,Σ))=12lndetΣ+h(N(0,I))=12ln((2πe)ndetΣ),\mathsf{h}(\mathcal{N}(\mu,\Sigma)) = \mathsf{h}(\mathcal{N}(0,\Sigma)) = \frac{1}{2} \ln \det \Sigma + \mathsf{h}(\mathcal{N}(0,I)) = \frac{1}{2} \ln((2 \pi e)^n \det \Sigma),

where the first transition is due to entropy's invariance to shifts, and the second is due to the change of variables under a liniar transform (recall that Σ1/2N(0,Σ)=N(0,I)\Sigma^{-1/2} \cdot \mathcal{N}(0,\Sigma) = \mathcal{N}(0,I)). We also used

h(N(0,I))=i=1nh(N(0,1))=n_R12πex2/2ln(12πex2/2)dx=n2ln(2πe)\mathsf{h}(\mathcal{N}(0,I)) = \sum_{i=1}^n \mathsf{h}(\mathcal{N}(0,1)) = -n \cdot \int\_\mathbb{R} \frac{1}{\sqrt{2 \pi}} e^{-x^2 / 2} \ln \left(\frac{1}{\sqrt{2 \pi}} e^{-x^2 / 2} \right) \mathrm{d}x = \frac{n}{2} \ln(2 \pi e)

Step 2 (mutual information in general Gaussian case): For (X,Y)(X,Y) with non-singular Σ\Sigma

I(X,Y)=h(X)h(Y)+h(X,Y)=12lndetΣX+12lndetΣX12lndetΣ,\mathsf{I}(X,Y) = \mathsf{h}(X) - \mathsf{h}(Y) + \mathsf{h}(X,Y) = \frac{1}{2} \ln \det \Sigma_X + \frac{1}{2} \ln \det \Sigma_X - \frac{1}{2} \ln \det \Sigma,

where ΣX\Sigma_X, ΣY\Sigma_Y are marginal covariances.

Step 3 (mutual information in general Gaussian case in terms of singular values): under the assumptions from the previous step, let ΣXY\Sigma_{XY} be the cross-covariance matrix. Thanks to the invariance property of MI, we can write:

I(X,Y)=I(ΣX1/2X,ΣY1/2Y)=I(UTΣX1/2X,VΣY1/2Y),\mathsf{I}(X,Y) = \mathsf{I}(\Sigma_X^{-1/2} X,\Sigma_Y^{-1/2}Y) = \mathsf{I}(\mathrm{U}^T \Sigma_X^{-1/2} X, \mathrm{V} \Sigma_Y^{-1/2}Y),

where Udiag(ρi)VT \mathrm{U} \text{diag}(\rho_i) \mathrm{V}^T is the SVD of ΣX1/2ΣXYΣY1/2\Sigma_X^{-1/2} \Sigma_{XY} \Sigma_Y^{-1/2}. However,

(UTΣX1/2X,VΣY1/2Y)N(μ,[Idiag(ρi)diag(ρi)I])(\mathrm{U}^T \Sigma_X^{-1/2} X, \mathrm{V} \Sigma_Y^{-1/2}Y) \sim \mathcal{N}\left(\mu', \begin{bmatrix} I & \text{diag}(\rho_i) \\\\ \text{diag}(\rho_i) & I \end{bmatrix}\right)

In the case of Lemma 4.1, the Gaussian vector is already of this form. Setting ρi=ρ\rho_i = \rho and computing the determinants yields the desired expression.

Please, note that we only omitted Step 1 in our text, instead referring to [1]. If we have misinterpreted your question, please let us know.

We would like to thank Reviewer uLwu once again for their time engaging in a discussion with us.

[1] Y. Polyanskiy and Y. Wu, Information Theory: From Coding to Learning. Cambridge University Press, 2024

评论

I would like to thank the authors for the additional explanation. Due to the importance of the conclusion in this paper to the field, I will increase my rating to 5.

审稿意见
4

Summary

This paper investigates the shortcomings of Sliced Mutual Information (SMI) as a tool for measuring statistical dependence in high-dimensional settings. The authors first derive theoretical results showing that vanilla 1-D SMI can saturate too early, bias toward redundant projections, and even decrease asymptotically as dimension grows. They then validate these findings with extensive synthetic experiments and representation learning case studies, demonstrating that optimizing SMI can lead to collapsed features and misleading dependence estimates. The work highlights that, despite its scalability, SMI may understate or misorder variable dependencies compared to true Mutual Information (MI).

优缺点分析


Strengths

  1. Novel and Practically Significant
    Sliced Mutual Information (SMI) is increasingly used as a scalable dependence measure in high-dimensional settings. This paper is the first to systematically identify and analyze key failure modes of SMI—such as premature saturation and redundancy bias—thus filling an important gap in the literature.

  2. Comprehensive Theory and Experiments
    The authors present rigorous mathematical proofs of SMI’s limitations and validate them through extensive synthetic experiments across various distributions and dimensions. The writing is clear and the figures/tables illustrate the arguments effectively.


Weaknesses

  1. Major:
    Although variants like max-SMI, optimal SMI, and k-SMI are mentioned, both the theoretical analysis and experimental work focus almost exclusively on 1D SMI. For instance, k-SMI can preserve much more information from the original distribution and often outperforms 1D SMI in practice. A full comparison with these variants is needed.

  2. Minor:
    Most experiments are conducted on synthetic data. While Figure 3 shows that SMI values can deviate substantially from true MI on real datasets, SMI still often captures the correct correlation ordering between variables. In real applications, preserving the correct ranking of dependencies can be more important than recovering the exact MI value.

问题


Questions

  • Invariance to Monotonic Transformations: SMI lacks the monotonic‐transformation invariance of true MI. Did you experiment with any preprocessing “tricks” (e.g., normalizing each dimension before slicing) to mitigate this issue?

局限性

None

格式问题

None

作者回复

Dear Reviewer Ly5f,

Thank you for your valuable feedback and insightful comments. We have carefully addressed your concerns.

Weaknesses:

  1. In fact, we have already extended our empirical analysis to k-SMI: Figure 6 (2‑SMI) and Figure 7 (3‑SMI) in the main manuscript (line 236) demonstrate that SMI continues to exhibit rapid saturation as the slice dimension increases. Although the closed‑form expression for k‑SMI is intractable, Proposition 4.2 provides its integral representation, from which one can infer the same saturation behavior we observe empirically. Moreover, max-SMI remains subject to the same redundancy bias we identify for SMI, while optimal SMI interpolates between max-SMI and vanilla SMI, and thus inevitably inherits both their advantages and limitations.

    We deliberately center our attention on SMI and k‑SMI, as these variants are most pervasive in practice due to their simple implementation (no auxiliary optimization is required). By contrast, max‑SMI and optimal SMI demand using auxiliary NNs, placing them in a higher complexity class.

  2. It is indeed true that when comparing MI and SMI, large differences between these two quantities are not of significant interest, and one should focus on SMI's ability to capture the correct ranking of dependencies. We explicitly state this on lines 220–223:

    SMI is a distinct measure of statistical dependance, and should not be viewed as an approximation of MI. Instead, our analysis focuses on the relationship between the two measures: since MI captures the true degree of statistical dependence, opposing trends in MI and SMI reveal problems with the latter quantity.

    That being said, our analysis shows that SMI is unable to provide a correct ranking of dependencies. Consider the two examples from Lemma 4.1 and Corollary 4.5. In the first case, MI is dd times higher than in the second case, while SMI is, in contrast, strictly smaller. Therefore, SMI incorrectly attributes stronger dependency to the case where, in fact, the dependency is weaker.

Questions:

  1. While we did not apply explicit normalization or other preprocessing "tricks" in our current experiments, we do discuss the impact of input scaling on SMI in Section 6.1 (see Table 1), where we contrast the sensitivity of SMI versus MI under two different normalization schemes. Our experiments reveal that SMI is excessively sensitive to the choice of normalization, suggesting that no single preprocessing method can be relied upon, since it can substantially change the measured dependence.

We hope our response addresses your comments clearly, and we are happy to provide further clarification if needed.

Thank you again for your thoughtful review and helpful insights.

评论

We appreciate your time reviewing our work. As the rebuttal period is concluding, we note that we responded to each point and hope these revisions address your concerns.

最终决定

This paper analyzes Sliced Mutual Information (SMI). Although SMI is widely used as a scalable alternative to MI, the paper demonstrates that it suffers from several shortcomings, often exhibiting counterintuitive behavior and failing to accurately capture statistical dependence.

The work provides a novel and valuable finding for the community, which is the main factor in my decision. The properties of SMI are carefully analyzed from a theoretical perspective, supported by simple experiments.

One limitation of the paper is the strong assumption in the theoretical analysis, which relies on Gaussian distributions. However, as highlighted in the reviewer–AC discussion, reviewers agreed that this is acceptable, given that the central contribution is to reveal flaws in the original SMI. Another weakness is the lack of a more thorough empirical evaluation, as reviewers also noted. Nonetheless, the authors provided convincing rebuttals, and reviewers felt their concerns were adequately addressed.

Overall, I believe this paper presents an important and fundamental contribution that merits publication in this community. I therefore recommend acceptance.

===

As recently advised by legal counsel, the NeurIPS Foundation is unable to provide services, including the publication of academic articles, involving the technology sector of the Russian Federation’s economy under a sanction order laid out in Executive Order (E.O.) 14024.

Based upon a manual review of institutions, one or more of the authors listed on this paper submission has ties to organizations listed in E.O. 14024. As a result this paper has been identified as falling under this requirement and therefore must not be accepted under E.O. 14024.

This decision may be revisited if all authors on this paper can provide proof that their institutions are not listed under E.O. 14024 to the NeurIPS PC and legal teams before October 2, 2025. Final decisions will be communicated soon after October 2nd. Appeals may be directed to pc2025@neurips.cc.