PaperHub
6.4
/10
Spotlight4 位审稿人
最低4最高4标准差0.0
4
4
4
4
2.5
置信度
创新性2.8
质量3.0
清晰度2.5
重要性2.8
NeurIPS 2025

Graph–Smoothed Bayesian Black-Box Shift Estimator and Its Information Geometry

OpenReviewPDF
提交: 2025-05-03更新: 2025-10-29

摘要

Label shift adaptation aims to recover target class priors when the labelled source distribution $P$ and the unlabelled target distribution $Q$ share $P(X \mid Y) = Q(X \mid Y)$ but $P(Y) \neq Q(Y)$. Classical black‑box shift estimators invert an empirical confusion matrix of a frozen classifier, producing a brittle point estimate that ignores sampling noise and similarity among classes. We present Graph‑Smoothed Bayesian BBSE (GS‑B$^3$SE), a fully probabilistic alternative that places Laplacian–Gaussian priors on both target log‑priors and confusion‑matrix columns, tying them together on a label‑similarity graph. The resulting posterior is tractable with HMC or a fast block Newton–CG scheme. We prove identifiability, $N^{-1/2}$ contraction, variance bounds that shrink with the graph’s algebraic connectivity, and robustness to Laplacian misspecification. We also reinterpret GS‑B$^3$SE through information geometry, showing that it generalizes existing shift estimators.
关键词
label shifttarget shiftdomain adaptation

评审与讨论

审稿意见
4

This paper proposes the ​​Graph-Smoothed Bayesian Black-Box Shift Estimator (GS-B3^3SE)​​ for label shift adaptation. Traditional Black-Box Shift Estimation (BBSE) estimates target class priors by inverting a classifier’s confusion matrix but suffers from two limitations: 1) sensitivity to sampling noise​​: Error amplification with limited validation data; 2) ignored semantic relationships​​: Independent per-class estimation without statistical sharing.

The proposed GS-B3^3SE is a fully probabilistic alternative that places Laplacian–Gaussian priors on both target log-priors and confusion-matrix columns, tying them together on a label-similarity graph. The key idea is to couple both the target log-prior vector and the columns of the confusion matrix through a Gaussian Markov random field defined on a label-similarity graph.

优缺点分析

Strengths​​:

  • Novel formulation​​: First unified Bayesian framework integrating graph regularization with confusion matrix uncertainty quantification.

  • ​​Theoretical rigor​​: Complete proofs for identifiability, optimal N1/2N^{-1/2} convergence, and variance reduction scaling with algebraic connectivity λ2(L)\lambda_2(L).

  • Multidisciplinary insight​​: Information geometry analysis (Section 6) reveals how graph priors act as ee-convex barriers, linking optimization to Riemannian dynamics.

Weaknesses​​:

  • ​​Computational overhead​​: HMC requires 4 chains × 1.5k samples (NVIDIA T4) for K=100; wall-clock time for Newton-CG not benchmarked.

  • ​​Underutilized geometric insights​​: Natural gradient flow (Section 6.1) lacks empirical optimization comparisons versus EM/variational methods.

问题

I'm curious about the practical scenarios where this 'Label Shift' methodology would be most applicable. At its core, the paper addresses robust estimation of target class priors under distribution shift, formalized as a shift in P(Y) while assuming P(X∣Y) remains constant. This raises three key questions:

  • ​​Applicability Conditions​​: When is the label shift framework the appropriate modeling choice compared to other distribution shift paradigms? What diagnostic tests exist to validate its core assumption in real-world applications?

  • ​​Causal Implications​​: Does the formulation P(X∣Y)=constant implicitly assume Y→X causation? If so, how does this align with domains where features X may causally influence labels Y?

  • Practicality Concerns​​: The proposed solution appears computationally intensive relative to the problem's conceptual framework. I wonder whether the complexity is justifiable for deployment scenarios where predictions derive from fixed embeddings. (As a non-expert, I acknowledge this perspective may benefit from domain-specific context.)

局限性

Yes

格式问题

No formatting issues.

作者回复

We would like to thank you for your very constructive comments.

When is the label-shift model appropriate?

We first summarize the major distribution shift settings as follows.

shift familyfactorizationinvariance assumption
covariate shiftPte(X)Ptr(X)P_{te}(X) \neq P_{tr}(X)P(YX)P(Y \mid X) unchanged
concept shiftPte(YX)Ptr(YX)P_{te}(Y \mid X) \neq P_{tr}(Y \mid X)none
label shiftPte(XY)=Ptr(XY)P_{te}(X \mid Y) = P_{tr}(X \mid Y)class-conditional features stable

The last column (our focus) includes

  • medical testing (pathogen prevalence rises or falls, but radiograph physics stays the same),
  • recommender or ad-click systems (item popularity drifts while the encoding of text or images stays fixed),
  • traffic scenes across seasons or geographies.

These settings share the property that collecting a handful of new labels is expensive, whereas unlabeled target images or clicks are plentiful.

A principled diagnostic test for label shift

Define

  • r^=1/n\hat{r} = 1 / n’ (histogram of target predictions),
  • C^\hat{C} (empirical confusion on the validation split),
  • p^\hat{p} (empirical source prior).

Under the null H0:q=pH_0 : q = p (no label shift),

n(r^C^p^)dN(0,C^diag(p^)C^).\sqrt{n’}(\hat{r} - \hat{C}\hat{p}) \overset{d}{\to} \mathcal{N}(0, \hat{C} \text{diag}(\hat{p})\hat{C}^\top).

Hence the likelihood-ratio (χ2\chi^2 statistic):

χBBSE2=n(r^C^p^)[C^diag(p^)C^]1(r^C^p^)H0χK12.\chi^2_{\text{BBSE}} = n’ (\hat{r} - \hat{C}\hat{p})^\top [\hat{C} \text{diag}(\hat{p})\hat{C}^\top]^{-1}(\hat{r} - \hat{C}\hat{p}) \overset{H_0}{\to} \chi^2_{K-1}.

Because p^\hat{p} enters only through the plug‑in diagonal and the null does not constrain it, the usual Bartlett correction is unnecessary; asymptotically we retain an ordinary χK12\chi_{K-1}^2.

If the pp-value is large, there is no statistical evidence for label shift, and no correction suffices.

If the null is rejected, we proceed to label shift correction approach, and the test is free because all three inputs r^\hat{r}, C^\hat{C}, p^\hat{p} are already computed.

Power of the diagnostic

Let δ=qp\delta = q - p be the true prior change.

Under the alternative,

χBBSE2dχK12(λ),λ=nδC[Cdiag(p)C]1Cδ>0,\chi^2_{\text{BBSE}} \overset{d}{\to} \chi^2_{K-1}(\lambda), \quad \lambda = n’\delta^\top C^\top [C \text{diag}(p)C^\top]^{-1}C\delta > 0,

i.e., a χ2\chi^2 distribution with non-centrality parameter λ\lambda.

The test therefore has asymptotic power

β(α)=Pr[χK12(λ)>χK1,1α2]n1,\beta(\alpha) = \Pr [\chi^2_{K - 1}(\lambda) > \chi^2_{K-1, 1-\alpha}] \overset{n’ \to \infty}{\to} 1,

whenever qpq \neq p, so even moderate sample sizes suffice to detect practically relevant prior changes.

Does label-shift require the causal arrow YXY \to X?

The assumption Pte(XY)=Ptr(XY)P_{te}(X \mid Y) = P_{tr}(X \mid Y) is a statement about invariance of a conditional distribution, not about causal direction. It can arise in two distinct ways.

  • Causal: YXY \to X and data-generation mechanism f ⁣:Y,ϵXf\colon Y, \epsilon \mapsto X unchanged, only Pr(Y)\Pr(Y) perturbed (an intervention on the root node YY in Pearl’s SCM sense).
  • Anticausal, selection: XYX \to Y and labeling policy Pr(YX)\Pr(Y \mid X) unchanged, but we sub-sample on YY.

Hence, label shift is valid whenever i) the mapping YXY \mapsto X (causal case) or the annotation policy XYX \mapsto Y (anticausal case) is stable across domains, and ii) only the base rate Pr(Y)\Pr(Y) drifts.

Either arrow therefore fits, the key is mechanism invariance, not direction.

Let an SCM be

Yg(U),Xf(Y,V),Y \coloneqq g(U), \quad X \coloneqq f(Y, V),

with exogenous U,VU, V. A shift that changes Pr(U)\Pr(U) (but leaves ff fixed) yields exactly label shift.

Conversely, in an anticausal model

Xh(W),Ydisc(X)+ξ,X \coloneqq h(W), \quad Y \coloneqq \text{disc}(X) + \xi,

we can obtain label shift by re-weighting samples according to YY after data collection, again P(XY)P(X \mid Y) is preserved.

How this perspective guides practice

  • If you can argue mechanism stability, e.g., “the radiography device and organ physiology are unchanged while disease prevalence drifts” or “the feature extractor is frozen”, then label shift is defensible regardless of causal direction.
  • If mechanism stability is doubtful, run the χ2\chi^2 diagnostic and inspect mis-fit. Large residuals against both label shift and covariate shift tests indicate a more complex general shift.

The robustness bound in our manuscript is therefore agnostic to causal direction: only the invariance of P(XY)P(X\mid Y) enters through F0\bm F_0.

Computational cost

We measure work in floating-point operations (flops) and write KK = number of classes and E|E| = number of graph edges.

For the kk-nearest neighbor graphs we use, E=kK|E| = kK with a small, KK-independent constant kk (4 for CIFAR‑10, 8 for CIFAR‑100).

  • Vanilla BBSE: The point estimator solves a single linear system q^=C~1r~\hat{q} = \tilde{C}^{-1}\tilde{r}, where C~RK×K\tilde{C}\in\mathbb R^{K\times K} is dense once any amount of Laplace or Tikhonov regularisation is applied. Gaussian elimination therefore costs O(K3)\mathcal{O}(K^3) flops and O(K2)\mathcal O(K^{2}) memory.
  • GS‑B3^3SE (one Newton–CG sweep): Write Φ=(ϕ1,,ϕK)RK×(K1)\Phi = (\phi_1,\dots,\phi_K)^\top \in \mathbb{R}^{K\times (K-1)} and θRK1\theta \in \mathbb{R}^{K-1}, both constrained to the centered subspace v1=0\\{v^\top 1=0 \\}. At each sweep we perform three algebraic operations:
    • update each column ΦΦHΦ1L\Phi \leftarrow \Phi - \mathcal{H}_{\Phi}^{-1}\nabla\mathcal{L},
    • update θ\theta analogously, and
    • Gamma shape/scale updates for τq\tau_q, τC\tau_C.

Adding them and using E=kK|E|=kK, one sweep = O(EK+K2)O(kK2)\mathcal{O}(|E|K + K^2) - \mathcal{O}(kK^2), and memory usage is O(E+K2)\mathcal{O}(|E| + K^2).

Overall complexity.

Thm. 2 establishes that F(q)F(q) is α\alpha-strongly geodesically convex with α=nλmin(F0)+τqλ2(L)>0\alpha = n’\lambda_{\min}(F_0) + \tau_q \lambda_2(L) > 0. Consequently a back-tracking Newton method requires T=O(ln1/ϵ)T = \mathcal{O}(\ln 1/\epsilon) sweeps to reach tolerance ϵ\epsilon.

Hence the total complexity is

O(kK2ln1/ϵ)vs.O(K3) (Vanilla BBSE).\mathcal{O}(kK^2 \ln 1 / \epsilon)\quad \text{vs.}\quad \mathcal{O}(K^3)\ (\text{Vanilla BBSE}).

Because kk is at most 8 and ln1/ϵ10\ln 1/\epsilon \le 10 in all our runs, the crossover happens already at K25K\ge 25.

operationcost on kk-NN graphremarks
vanilla BBSEO(K3)O(K^3) oncedense LU factorisation of the K×KK\times K confusion matrix.
Newton-CG step for oursO(E)\+O(K2)O(\|E\|) \+ O(K^2)gradient is linear in E\|E\|, Hessian-vector product needs one sparse Laplacian-matvec O(E)O(\|E\|) plus one dense Jacobian-matvec O(K2)O(K^2). Conjugate-gradient converges in O(κ)O(\sqrt{\kappa}) iterations, empirically 5\leq 5.
HMC leap-frogsame O(E)+O(K2)O(\|E\|) + O(K^2) per gradienttrajectories of length 10-20 give effective sample size >200> 200 for every qiq_i.

Therefore the dominant term grows only quadratically in the class count KK (the sparse Laplacian never densifies), whereas BBSE’s one‑off inverse is cubic.

Next, we report the wall-clock runtimes (single NVIDIA T4, PyMC 5.10, float32).

datasetKBBSEMLLSours
MNIST100.80 s1.12 s12.2 s
CIFAR-1001002.88 s5.37 s54.5 s

As you suggested, we believe these results help users to consider the accuracy-complexity trade-off.

All times are <1 min on a single T4; in practice the wall‑clock is dominated by the CNN forward‑pass when nn'≈10 k.

评论

Thank you very much for your detailed response. While it has addressed some of my questions, I feel there remains a degree of ambiguity that I hope to clarify further.

Regarding the answer to "A principled diagnostic test for label shift," it demonstrates how to test for shifts in P(Y)P(Y). However, my original concern centers on how to distinguish between "label shift" and other distribution shift scenarios.

Please allow me to elaborate on this concern from a practical perspective. In practice, model updates are necessitated when predictive performance deteriorates (via metrics such as AUC/GAUC or business indicators like click-through rates). In my understanding, if the performance decay is caused solely by "covariate shift" while P(YX)P(Y∣X) remains intact, the model does not require updating. Conversely, when P(YX)P(Y∣X) changes, it becomes critical to determine whether the underlying cause is "label shift" or "concept shift."

评论

Our work assumes the label-shift model is a reasonable abstraction and focuses on how to estimate qq robustly once P(YX)P(Y \mid X) is plausibly stable. Determining the dominant shift mechanism in a live system is an orthogonal problem (out-of-scope); still, we can offer a lightweight triage that uses quantities already computed by BBSE/GS-B³SE plus a small labeled audit set if available.

Recall that inputs we already have (unlabeled target):

  • r^=n~/n\hat{r} = \tilde{n} / n’ (target prediction histogram),
  • C^\hat{C} (source confusion, from validation),
  • p^\hat{p} (source prior).

Use the BBSE χ2\chi^2 test (derived in our rebuttal):

χBBSE2=n(r^C^p^)[C^diag(p^)hatC]1(r^C^p^),H0:q=pχK12.\chi^2_{\text{BBSE}} = n’(\hat{r} - \hat{C}\hat{p})^\top [\hat{C}\text{diag}(\hat{p})hat{C}^\top]^{-1}(\hat{r} - \hat{C}\hat{p}), \overset{H_0:q=p}{\sim} \chi_{K-1}^2.

Large p-value indicates that no evidence of label-prior drift; label-shift correction unnecessary, and if we get small p-value we may proceed to next step.

Compute q^\hat{q} with BBSE/GS-B³SE, then the residual ϵ=r^C^q^\epsilon = \hat{r} - \hat{C}\hat{q}. Under exact label shift E[ϵ]=0\mathbb{E}[\epsilon] = 0 and, asymptotically,

nϵ[C^diag(q^)C^]1ϵH0χK12.n’\epsilon^\top [\hat{C}\text{diag}(\hat{q})\hat{C}]^{-1}\epsilon \overset{H_0}{\sim} \chi^2_{K-1}.

Here,

  • Insignificant residual → evidence is consistent with label shift; apply label shift adaptation.
  • Significant residual → label shift alone cannot explain the drift; go to next step.

With mm labeled target points, estimate C^(Q)\hat{C}^{(Q)} on the audit, test H0:C^(Q)=C^(P)H_0 : \hat{C}^{(Q)} = \hat{C}^{(P)}. If rejected, concept shift (posterior changes), and if not rejected but performance still drops, likely covariate shift.

One-line algebraic decomposition (why the residual diagnoses violations)

Let h^\hat{h} be the frozen classifier and C(P)C^{(P)} its source confusion. On target,

r^=C(Q)q=C(P)q+(C(Q)C(P))q,\hat{r} = C^{(Q)}q = C^{(P)}q + (C^{(Q)} - C^{(P)})q,

so the ideal label shift fit C(P)qC^{(P)}q misses exactly the concept drift term (C(Q)C(P))q(C^{(Q)} - C^{(P)})q. The above residual test in is a direct probe of this term without needing target labels.

评论

Thank you once again for your thoughtful feedback. We would be grateful for any additional insights you could share so that we may resolve the remaining points promptly.

In particular, we would like to emphasize that label shift is not merely an observed phenomenon but, an explicit working assumption (as in L90 in the manuscript). Clarifying this perspective is crucial for accurately positioning the contribution of the manuscript.

Could you kindly let us know if there are specific sections or arguments that still feel unclear from your standpoint? Your further comments will greatly help us refine the revision and ensure it meets the review standards.

Thank you very much for your time and consideration. We look forward to your response.

审稿意见
4

This paper focuses on the label shift problem, where the source and target domain share the same conditional distribution, Ps(XY)=Pt(XY)P_s (X│Y)=P_t (X|Y) but differ in class priors, Ps(Y)Pt(Y)P_s (Y)≠P_t (Y). A common approach to this problem is the Black-Box Shift Estimator (BBSE) method, which estimate a confusion matrix C and an empirical target prior P ̂_t (Y) from finite samples to recover the true target prior Pt(Y)P_t (Y) , as P_t (Y)=C^(-1) P ̂_t (Y). However, BBSE ignores the finite sample noise and semantic structure information during estimation. Therefore, this paper proposes a hierarchical Bayesian model (In Line 133-134) to improve the BBSE method.

In this model, the authors introduce a similarity graph on labels and utilize the graph Laplacian as covariance matrix in the distributions of latent variables, like θ and ϕi.ϕ_i. With these latent variables θ and ϕiϕ_i, the model can estimate both the empirical target prior P ̂_t (Y) and the confusion matrix CC, then the BBSE method can be applied to recover the true target prior.

优缺点分析

  • Strengths

    1. This paper proposes a novel method to improve BBSE, which couples the target prior and confusion matrix through a hierarchical Bayesian model.
    2. The underlying idea that replacing the point estimate of BBSE with Bayesian estimate to calculate the confusion matrix and empirical target prior is reasonable.
    3. The propose method is theoretically solid. In Proposition 2, the authors give an upper bound of estimation error for θ.
  • Weaknesses

    1. The main weakness lies in the experimental section. (1) Lack of ablation study. In Table 3, the proposed GS-B3SE outperforms other methods. However, it relies on CLIP embeddings to construct the similarity graph, and there is no ablation comparing alternative graph construction methods. Since CLIP [1] demonstrates strong zero-shot performance on datasets like CIFAR-10 (zs acc: 91.3) and CIFAR-100 (zs acc: 65.1), the authors should provide additional experiments to clarify whether the observed improvements stem from the proposed method itself rather than the use of CLIP features. (2) Inconsistent graph construction. In Lines 274–278, the method for constructing the similarity graph differs across datasets. For MNIST, image embeddings are used, whereas for CIFAR-10 and CIFAR-100, text embeddings are employed. This inconsistency raises important questions: How do these different types of features impact the final performance? For a given dataset, how should one decide which type of embedding to use for graph construction?
    2. In Line 110-112, the authors assume the number of predicted samples per classes, niSn_i^S, follows a multinomial distribution. However, since niSn_i^S is actually non-negative, this assumption may not hold.
    3. In Proposition 2, the notation θ0θ_0 is not defined.
    4. Proposition 2 illustrates how the minimum eigenvalue of the graph Laplacian influences the estimation error. However, it seems that label similarity (or edge weights) does not directly impact the estimation of variables. Can we directly optimize the graph Laplacian with specific constrains?

问题

Please refer to the weaknesses part.

局限性

The proposed method is novel and theoretically solid. However, the empirical evaluation is not convincing. And the presentations have some flaws.

最终评判理由

The proposed method is novel and theoretically solid. Some concerns are addressed in rebuttal. However, the evaluation and utilization aspects are sill problematic. I raise my rating to slightly positive.

格式问题

No paper formatting concerns.

作者回复

We would like to thank you for your very constructive comments.

Ablations on graph construction and the role of CLIP

We provide the additional ablations on graph construction.

In this experiment, we consider the following strategies.

  • CLIP-text (already in the current manuscript): cosine kk-NN on text prompts.
  • Penultimate-image: Euclidean kk‑NN on frozen ResNet‑18 penultimate features (identical backbone as the classifier).
  • Random-embeddings: same sparsity pattern as CLIP‑text but edge weights i.i.d. U(0,1)U(0, 1) (destroys semantics, preserves Laplacian spectrum scale).
  • Identity graph: L=0L = 0, switches the hierarchy off (reduces to independent logistic‑Normal prior and hence Bayesian BBSE with no smoothing).

The results are as follows.

datasetgraphq^q1\|\|\hat{q} - q\|\|_1 \downarrowacc \uparrow
CIFAR-10CLIP-text0.025 ±\pm 0.0040.844 ±\pm 0.003
penult-img0.029 ±\pm 0.0050.839 ±\pm 0.004
random0.043 ±\pm 0.0090.820 ±\pm 0.006
identity0.051 ±\pm 0.0080.813 ±\pm 0.004
BBSE (freq.)0.112 ±\pm 0.0150.781 ±\pm 0.004
CIFAR-100CLIP-text0.22 ±\pm 0.020.783 ±\pm 0.005
penult-img0.27 ±\pm 0.030.771 ±\pm 0.006
random0.46 ±\pm 0.050.735 ±\pm 0.009
identity0.70 ±\pm 0.070.730 ±\pm 0.010
BBSE (freq.)1.62 ±\pm 0.050.690 ±\pm 0.006

Remarks

  • Any connected graph (λ2(L)>0\lambda_2(L) > 0) already helps over the frequentist point estimate; cf. the “random” line.
  • The more task-aligned the similarity (CLIP > penult-img > random), the lower the bias term.
  • The identity graph reproduces Bayesian BBSE with no smoothing and is almost identical to MLLS. Its gap to CLIP shows the specific benefit of our graph-coupled hierarchy.

Why do the graph ablations behave as the above?

Recall the posterior variance bound we already proved:

Var(qidata)Cλ2(L)N,\text{Var}(q_i \mid \text{data}) \leq \frac{C}{\lambda_2(L)N},

which holds for any connected graph.

If we write the mean-squared error as

E[(q~iqi)2]=(E[q~i]qi)2+Var(q~i),\mathbb{E}[(\tilde{q}_i - q_i)^2] = (\mathbb{E}[\tilde{q}_i] - q_i)^2 + \text{Var}(\tilde{q}_i),

then, it shows the variance term shrinks by a factor 1/λ2(L)1 / \lambda_2(L).

Even a kk-NN graph whose edge weights are i.i.d. U(0,1)\mathcal{U}(0, 1) has, with high probability,

λ2(Lrandom)ckK,\lambda_2(L_{\text{random}}) \gtrsim c\frac{k}{K},

for kclnKk \geq c’ \ln K. Hence a purely random graph already cuts the class-wise standard error by K/(ckN)\sqrt{K / (ckN)} compared with the identity graph λ2=0\lambda_2 = 0. This explain why the “random” row in the ablation table still beats the deterministic frequentist BBSE (it keeps all the variance relief while paying only a bias penalty).

Inconsistent graph construction (text vs. image embeddings)

Reviewer’s concern is “Why text on CIFAR but image on MNIST?”

  • On MNIST the class names are single digits (0, 1, 2, …), so CLIP-text gives almost zero variance (every digit is equally dissimilar). Image-space features do capture visual adjacency (e.g., 3 ↔ 8 ↔ 9)
  • On CIFAR the opposite holds: CLIP’s joint vision-language pre-training encodes high-level semantics missing from raw pixel features (e.g., “truck” closer to “bus” than to “cat”).

Multinomial assumption

Recap the notation:

  • niSn_i^S: # source-validation examples whose true class is ii.
  • NiSN_i^S: # predicted labels produced by the frozen classifier h^\hat{h} among those niSn_i^S examples.
  • N~\tilde{N}: # predicted labels on the nn’ unlabeled target points}.

These are integer-valued random vectors by construction. The standard sampling model is therefore

NiSCMult(niS,C,i),N~C,qMult(n,Cq),N_i^S \mid C \sim \text{Mult}(n_i^S, C_{\cdot, i}), \quad \tilde{N} \mid C, q \sim \text{Mult}(n’, Cq),

where each realization of a multinomial distribution is, of course, a vector in Nk\mathbb{N}^k.

The assumption is therefore fully consistent with the data’s discrete nature.

Later, we normalize these integers, and those are vectors in the simplex (non-negative reals summing to 1), not counts. The reviewer’s remark likely stem from seeing these normalized quantities and mistaking them fro the random objects modeled in the manuscript. We will add the explicit sentence to avoid ambiguity.

For completeness, if one wishes to work directly with the normalized vectors, two rigorously equivalent formulations are available:

  • Dirichlet-multinomial posterior: conditioning a Dirichlet prior on the multinomial counts yields a Dirichlet posterior for the probability vector.
  • Gaussian approximation (large nn delta method): n(r^r)N(0,diag(r)rr)\sqrt{n}(\hat{r} - r) \to \mathcal{N}(0, \text{diag}(r) - rr^\top).

Our Laplacian‑Gaussian hierarchy can be re‑expressed in either language.

Missing notation

Throughout the manuscript, we use

θi=logqi1Kk=1Klogqk,\theta_i = \log q_i - \frac{1}{K}\sum^K_{k=1}\log q_k,

i.e., the centered log-odds of a prior vector qq.

Therefore, θ0=θ(q0)\theta_0 = \theta(q_0) is simply the centered log-odds of the true target prior q0q_0.

Edge weights don’t seem to matter; can we optimize LL?

Why the weights do matter

The upper bound we proved is

τqNλmin(F0)+τqλ2(L)(LL0)θ02+OP(N1).\frac{\tau_q}{N \lambda_{\min}(F_0) + \tau_q \lambda_2(L)}||(L - L_0)\theta_0||_2 + O_P(N^{-1}).

Here,

  • Algebraic connectivity λ2(L)\lambda_2(L) (the smallest strictly-positive eigenvalue) appears in the denominator because it is the worst-case curvature of the quadratic penalty along any tangent direction.
  • The numerator contains the full matrix-difference (LL0)θ0(L - L_0)\theta_0. This encodes the signed, edge-wise mismatch between our chosen weights and the unknown ground-truth graph L0L_0. If we perturb a single edge weight, it changes in direct proportion to that entry. Hence the pattern of weights does affect the bias term, and λ2(L)\lambda_2(L) merely the concise way of writing the spectral part of the denominator.

Can we optimize LL?

Yes, nothing in the hierarchy forbids putting a prior on the weights.

Two concrete extensions are feasible:

  • Empirical Bayes (plug-in refinement): treat LL as hyper-parameter and maximize the marginal likelihood L(L)=logp(dataL)\mathcal{L}(L) = \log p(\text{data} \mid L). Because the posterior over (q,C)(q, C) is log-concave given LL, the Laplace approximation mathcalL(L)logp(dataq^(L),C^(L),L)12logdetH1mathcal{L}(L) \approx \log p(\text{data} \mid \hat{q}(L), \hat{C}(L), L) - \frac{1}{2}\log \det H^{-1} is cheap, and the Hessian HH is block-diagonal with Laplacian plus Fisher terms.
  • Fully Bayesian weight optimization: place independent Gamma or log-normal priors on each edge weight wijw_{ij}, and the quadratic form becomes τ2(i,j)Ewij(θiθj)2\frac{\tau}{2}\sum_{(i,j)\in E}w_{ij}(\theta_i - \theta_j)^2. The conditional for wijw_{ij} is conditionally log-Gaussian given θ\theta, and a single Gibbs slice or HMC dimension is enough (dimensionality equals number of edges \ll number of parameters in CC). The resulting chain mixes well because wijw_{ij} appears only linearly in the quadratic exponent.

With random-weight priors identifiability still holds because the confusion-matrix block stays full-rank as before, qq remains on the simplex, and the new edge-weight parameters enter only through the positive-definite quadratic term, so the joint Fisher information expands to

(NF0+τL(w)0;0diag((θiθj)2)),\begin{aligned} \begin{pmatrix} N F_0 + \tau L(w) & 0; \\ 0 & \text{diag}((\theta_i - \theta_j)^2) \end{pmatrix}, \end{aligned}

which is full-rank provided at least one (θiθj)0(\theta_i - \theta_j) \neq 0.

When does optimizing LL help?

Using our upper bound, we can give a crisp sufficient condition:

If(LnewL0)θ0λ2(Lnew)<(LoldL0)θ0λ2(Lold),then the asymptotic MSE bound tightens.\text{If}\quad \frac{||(L^{\text{new}} - L_0)\theta_0||}{\lambda_2(L^{\text{new}})} < \frac{|| (L^{\text{old} - L_0})\theta_0||}{\lambda_2(L^{\text{old}})},\quad \text{then the asymptotic MSE bound tightens}.

Hence fine-tuning is valuable whenever a small decrease in mismatch can be achieved without collapsing algebraic connectivity.

评论

Thanks for the detailed rebuttal. Some minor points are fully addressed and I can understand this paper better. The evaluation part is also strengened by additional ablation study. However, I still concern about the practical aspect of this method. Based on the overall quality and rebuttal, I will raise my rating.

审稿意见
4

This paper proposes a new method called GS-B³SE to address a practical problem in machine learning: label shift. This problem refers to a scenario where the class proportions (P(y)) change between the training and test sets, while the class-conditional feature distributions (P(x|y)) remain the same.

Existing black-box methods (like BBSE) typically provide only a single point estimate, which makes them sensitive to sampling noise and completely ignores the potential semantic similarity between different classes (e.g., 'cat' and 'dog' are more similar than 'cat' and 'car').

The authors' core contribution is a fully Bayesian framework to address these issues. They don't just estimate a single target distribution q; instead, they introduce probabilistic priors for both the target distribution q and the classifier's confusion matrix C. The most crucial step is their use of a pre-constructed "class similarity graph" to impose a smoothing constraint on these priors. This graph structure allows semantically similar classes to "borrow" information from each other during parameter estimation. The benefit of this approach is that the model can better handle classes with sparse data, and the resulting estimates are more robust due to the consideration of uncertainty.

优缺点分析

Strengths

  • The idea of introducing a graph structure into label shift estimation and innovatively using it to simultaneously smooth the priors for both the target distribution q and the confusion matrix C is very clever.

  • The theoretical analysis is robust. Furthermore, the paper interprets the method from an information geometry perspective, which undoubtedly adds depth to the work.

  • The paper is written clearly, making it easy for readers to follow the authors' line of thought.

Weaknesses

  • The entire model's core advantage is built upon the pre-given class similarity graph, L. This graph is constructed using embeddings from a pre-trained model (like CLIP) and remains fixed. This raises a significant question: To what extent does the quality of this graph determine the model's performance ceiling? What happens if the semantic similarity captured by the embeddings does not align with the classifier's confusion patterns? Although the paper analyzes the model's robustness to an inaccurate graph in Proposition 2, the bias term ||(L-L0)θ0|| remains, and it's unclear to the reader how large this term might be in practice. Positing graph learning as future work is reasonable, but this is indeed a core limitation of the current method.

  • Model inference relies on HMC sampling or a block Newton-CG optimizer. HMC is notoriously slow, especially for high-dimensional posteriors. While the authors mention a "fast" Newton-CG scheme, they provide no comparison of model runtimes in the experimental section. For a module positioned as a "pure post-processing" step, its practicality diminishes if its computational overhead is orders of magnitude higher than baselines. Its scalability, especially as the number of classes K becomes large (e.g., on CIFAR-100), is a question that needs to be answered.

  • The model assumes that the columns of the confusion matrix C are similar because "columns corresponding to neighbouring predicted classes exhibit similar shape." Column i of C, C_:,i, represents the probability distribution over predicted labels for a sample whose true class is i. Why should one assume that the predicted distribution for a true 'truck' is similar in shape to the predicted distribution for a true 'bus'? While plausible, this is not self-evident. An alternative approach, such as smoothing the rows of C, seems equally plausible (i.e., for a given predicted label like 'truck', the distribution of its true source labels is similar to that for 'bus'). A more detailed discussion and justification for this specific design choice would be beneficial.

问题

  • Regarding Graph Quality and Sensitivity: The model's performance appears to be highly dependent on the pre-constructed graph. To better understand this dependency, could you conduct some "destructive" experiments? For instance, how much does performance degrade when using a lower-quality graph (e.g., a k-NN graph generated from random embeddings)? This would help us more intuitively understand the practical significance of the robustness bound in Proposition 2.

  • Regarding Computational Efficiency: Thank you for the detailed experiments. Could you supplement them with the actual wall-clock runtimes for GS-B³SE and the main baselines (e.g., MLLS, BBSE)? This comparison would be particularly valuable for the CIFAR-100 dataset (with K=100) and would allow potential users to better weigh the trade-off between improved accuracy and computational cost.

  • Regarding Model Design Choices: As mentioned earlier, you chose to smooth the column vectors of the confusion matrix. Could you elaborate on the intuition behind this design? Did you consider or experiment with other smoothing approaches, such as smoothing the row vectors? Is there theoretical or empirical evidence to suggest that smoothing the columns is the superior choice?

局限性

The limitations are discussed in the conclusion section.

最终评判理由

Overall, the rebuttal has strengthened the paper by clarifying theoretical underpinnings and providin new data. While the practical trade-offs regarding the dependency on high-quality graphs for peak performance and the increased computational cost remain relevant points for discussion, the core contributions of the paper are sound and well-defended. The rebuttal has successfully addressed the most critical theoretical questions, even if some practical limitations persist.

格式问题

N/A

作者回复

We would like to express our utmost appreciation for your very detailed comments on our manuscript.

Graph quality & sensitivity

Recall Prop. 2, the following table summarizes the bias and variance of vanilla BBSE versus Bayesian shrinkage.

estimatorbiasvariance (per class)
vanilla BBSE0 (in expectation)Var(q^i)=O(κi2/N)\text{Var}(\hat{q}_i) = O(\kappa_i^{-2} / N)
oursbi=τqNλmin(F0)+τqλ2(L)[(LL0)θ0]ib_i = \frac{\tau_q}{N \lambda_{\min}(F_0)} + \tau_q \lambda_2(L)[(L - L_0)\theta_0]_iVar(qˉi)Cλ2(L)N\text{Var}(\bar{q}_i) \leq \frac{C}{\lambda_2(L)N}

Here, κi\kappa_i is the ii-th sensitivity of the matrix inverse, which can be tiny for rare classes, hence the variance blows up. Thus

  • Vanilla BBSE is unbiased but pays no attention to sampling noise, and its variance grows like the squared condition number of C~\tilde{C}.
  • Bayesian shrinkage version trades a bit of bias bib_i for a potentially huge reduction in variance. Specifically, amplitude factor is zero when the graph is correct, and otherwise it measures how much the misspecified graph distorts θ0\theta_0 along non-null Laplacian directions. In addition, damping N1N^{-1} and λ2(L)\lambda_2(L) ensure the bias decays with i) more data and ii) greater algebraic connectivity. That is, even for a badly misspecified graph the leading term cannot exceed τ1/(Nλmin(F0))θ02\tau_1 / (N \lambda_{\min}(F_0))||\theta_0||_2, i.e., the same order as the frequentist standard error.

Below, we report the additional experimental results on the edge-corruption study.

DatasetKCorruption rate ρ\rhoSaerens-corrected acc
MNIST100.000.986
0.250.982
0.500.977
0.750.971
1.000.961
CIFAR-1001000.000.783
0.250.778
0.500.769
0.750.752
1.000.735

Protocol. We start from the CLIP-derived kk-NN graph L0L_0 (k = 4 for K = 10, k = 8 for K = 100). A corruption level ρ\rho means we randomly re‑wire ρE\rho\cdot |E| undirected edges (preserving degrees); the resulting Laplacian is L. All other hyper-parameters are kept fixed.

What does a “completely random” graph do?

In the above table, ρ=1.00\rho = 1.00 means the completely random graph, but the resulting performance is still competitive against MLLS or other methods. To understand this observation, we can give the following discussion.

Take the Laplacian LL obtained by connecting each label to kk random neighbors (uniform among the other K1K - 1 labels). Then,

  1. For an Erdős–Rényi‑like graph the second eigenvalue satisfies λ2(L)k\lambda_2(L) \approx k with high probability, so the variance bound Var(qˉi)C/λ2(L)N\text{Var}(\bar{q}_i) \leq C / \lambda_2(L)N is smaller than for the identity-Laplacian (λ2=0\lambda_2 = 0). In other words, random edges provide extra shrinkage.
  2. Because the edge directions are symmetric, each row of LL0L - L_0 has positive and negative entries that tend to cancel when dotted with the centered log-odds θ0\theta_0: E[(LL0)θ0]=0\mathbb{E}[(L - L_0)\theta_0] = 0, and the random fluction has magnitude OP(1)O_P(1). Therefore, bi=OP(N1)||b_i|| = O_P(N^{-1}), i.e., the bias is the same order as the stochastic error we cannot avoid anyway.
  3. Putting the pieces together, we have the following which is precisely the rare-class / ill-conditioned regime where BBSE collapses.
bi2+Var(qˉi)Var(q^i)when λ2(L)1 or κi21.b_i^2 + \text{Var}(\bar{q}_i) \ll \text{Var}(\hat{q}_i) \quad \text{when $\lambda_2(L) \gg 1$ or $\kappa_i^{-2} \ll 1$}.

Therefore, even though the random graph carries no semantic signal, its connectivity slashes variance enough to compensate for the negligible N1N^{-1} bias, so the total error remains well below BBSE.

In addition, we provide the additional ablations on graph construction.

  • CLIP-text (already in the current manuscript): cosine kk-NN on text prompts.
  • Penultimate-image: Euclidean kk‑NN on frozen ResNet‑18 penultimate features (identical backbone as the classifier).
  • Random-embeddings: same sparsity pattern as CLIP‑text but edge weights i.i.d. U(0,1)U(0, 1) (destroys semantics, preserves Laplacian spectrum scale).
  • Identity graph: L=0L = 0, switches the hierarchy off (reduces to independent logistic‑Normal prior and hence Bayesian BBSE with no smoothing).
datasetgraphq^q1\|\|\hat{q} - q\|\|_1 \downarrowacc \uparrow
CIFAR-10CLIP-text0.025 ±\pm 0.0040.844 ±\pm 0.003
penult-img0.029 ±\pm 0.0050.839 ±\pm 0.004
random0.043 ±\pm 0.0090.820 ±\pm 0.006
identity0.051 ±\pm 0.0080.813 ±\pm 0.004
BBSE (freq.)0.112 ±\pm 0.0150.781 ±\pm 0.004
CIFAR-100CLIP-text0.22 ±\pm 0.020.783 ±\pm 0.005
penult-img0.27 ±\pm 0.030.771 ±\pm 0.006
random0.46 ±\pm 0.050.735 ±\pm 0.009
identity0.70 ±\pm 0.070.730 ±\pm 0.010
BBSE (freq.)1.62 ±\pm 0.050.690 ±\pm 0.006

Remarks

  • Any connected graph (λ2(L)>0\lambda_2(L) > 0) already helps over the frequentist point estimate; cf. the “random” line.
  • The more task-aligned the similarity (CLIP > penult-img > random), the lower the bias term.
  • The identity graph reproduces Bayesian BBSE with no smoothing and is almost identical to MLLS. Its gap to CLIP shows the specific benefit of our graph-coupled hierarchy.

Why do the graph ablations behave as the above?

Recall the posterior variance bound we already proved:

Var(qidata)Cλ2(L)N,\text{Var}(q_i \mid \text{data}) \leq \frac{C}{\lambda_2(L)N},

which holds for any connected graph.

If we write the mean-squared error as

E[(q~iqi)2]=(E[q~i]qi)2+Var(q~i),\mathbb{E}[(\tilde{q}_i - q_i)^2] = (\mathbb{E}[\tilde{q}_i] - q_i)^2 + \text{Var}(\tilde{q}_i),

then, it shows the variance term shrinks by a factor 1/λ2(L)1 / \lambda_2(L).

Even a kk-NN graph whose edge weights are i.i.d. U(0,1)\mathcal{U}(0, 1) has, with high probability,

λ2(Lrandom)ckK,\lambda_2(L_{\text{random}}) \gtrsim c\frac{k}{K},

for kclnKk \geq c’ \ln K. Hence a purely random graph already cuts the class-wise standard error by K/(ckN)\sqrt{K / (ckN)} compared with the identity graph λ2=0\lambda_2 = 0. This explain why the “random” row in the ablation table still beats the deterministic frequentist BBSE (it keeps all the variance relief while paying only a bias penalty).

Summarizing the above discussion, we obtain the following remark.

Shrinkage beats instability. BBSE pays no price for bias but suffers large sampling variance, especially when classes are imbalanced and C~\tilde{C} is noisy. Graph-based Bayesian pooling, even with arbitrary edges, acts like a data-dependent ridge penalty that stabilises all estimates.

Computational cost

To address this point, we first summarize the per-iteration algebraic cost as follows.

operationcost on kk-NN graphremarks
vanilla BBSEO(K3)O(K^3) oncedense LU factorisation of the K×KK\times K confusion matrix.
Newton-CG step for oursO(E)\+O(K2)O(\|E\|) \+ O(K^2)gradient is linear in E\|E\|, Hessian-vector product needs one sparse Laplacian-matvec O(E)O(\|E\|) plus one dense Jacobian-matvec O(K2)O(K^2). Conjugate-gradient converges in O(κ)O(\sqrt{\kappa}) iterations, empirically 5\leq 5.
HMC leap-frogsame O(E)+O(K2)O(\|E\|) + O(K^2) per gradienttrajectories of length 10-20 give effective sample size >200> 200 for every qiq_i.

Therefore the dominant term grows only quadratically in the class count KK (the sparse Laplacian never densifies), whereas BBSE’s one‑off inverse is cubic.

Next, we report the wall-clock runtimes (single NVIDIA T4, PyMC 5.10, float32).

datasetKBBSEMLLSours
MNIST100.80 s1.12 s12.2 s
CIFAR-1001002.88 s5.37 s54.5 s

As you suggested, we believe these results help users to consider the accuracy-complexity trade-off.

Design choice

The identifiability relation used by all BBSE‑type estimators is q~=Cq\tilde{q} = Cq, where q~ΔK1\tilde{q} \in \Delta^{K-1} is the prediction histogram on the target domain, the unknown parameter qΔK1q \in \Delta^{K-1} is the target prior, and the ii-th column ci=C,ic_i = C_{\cdot,i} is ci=[p(h^(X)=jY=i)]j=1Kc_i = [p(\hat{h}(X) = j \mid Y = i)]^K_{j=1}. Here, only the columns cic_i multiply the unknown vector qq, and the rows rj=Cj:r_j = C_{j:}^\top never appear in the above, so they matter only if one later inverts CC. Hence, any prior that tries to stabilize the estimation of qq should regularize the cic_i’s, not the rjr_j’s.

Next, write a first-order delta expansion of the posterior mean

q~=q0+i=1K(F01ei)δci,ei+OP(N1),\tilde{q} = q_0 + \sum^K_{i=1}(F_0^{-1}e_i)\langle \delta c_i, e_i \rangle + O_P(N^{-1}),

where F0=diag(r0)r0r0F_0 = \text{diag}(r_0) - r_0 r_0^\top is the Fisher matrix of the multinomial likelihood for q~\tilde{q}, δci=ci=c0,i\delta c_i = c_i = c_{0,i} denotes the estimation noise on the ii-th column, and eie_i is the ii-th standard basis vector. Taking variances and using independence of source and target samples,

Var(qkˉ)=i=1K[F01]ki2Var(δci,ei)+O(N1).\text{Var}(\bar{q_k}) = \sum^K_{i=1} [F_0^{-1}]^2_{ki}\text{Var}(\langle \delta c_i, e_i \rangle) + O(N^{-1}).

Thus,

  • if we Laplacian-shrink the columns cic_i, it shows a direct variance reduction by a factor (τCλ2(L))1(\tau_C\lambda_2(L))^{-1}, exactly the bound proved in Cor. 1 of the manuscript, and
  • if we instead Laplacian-shrink the rows rjr_j, the random variable δci,ei=δCi,i\langle \delta c_i, e_i \rangle = \delta C_{i,i} does not get penalized, so the leading variance term is unaffected. In other words, row smoothing hardly helps for the quantity we ultimately care about, qq.

Semantic interpretation

  • Predictive columns cic_i capture how much class ii is confused with every other class in the frozen classifier. Two semantically close source classes (e.g., truck and bus) tend to produce similar confusion fingerprints, so a Laplacian on ii-index is natural.
  • Rows rjr_j describe where each predicted label actually comes from in the source distribution, a statistic that never appears in the above identifiability and is only weakly correlated with the target-prior calculation.
评论

Overall, the rebuttal has strengthened the paper by clarifying theoretical underpinnings and providin new data. While the practical trade-offs regarding the dependency on high-quality graphs for peak performance and the increased computational cost remain relevant points for discussion, the core contributions of the paper are sound and well-defended. The rebuttal has successfully addressed the most critical theoretical questions, even if some practical limitations persist.

审稿意见
4

By treating the confusion matrix as latent variables, this manuscript considers a fully Bayesian alternative to classical black-box label-shift estimators in the standard label-shift problem. The estimator can computed via HMC or Newton–conjugate-gradient optimizer. Numerical simulation and real-data experiments are run to test the proposed algortihm.

优缺点分析

I'm not an expert in label shift or ML with graphs. So, probably I am not able to comment on the strengths and the weakness. Nevertheless, I do have a few questions regarding the submission. Please see next section.

问题

  1. In practice, how is the similarity graph constructed? I suppose such graph is usually noisy in real-world data. How will the noise affect the performance of the proposed algorithm?

  2. How is the computational complexity of the algorithm compared with the standard BBSE? I suppose sampling from and inferring with the posterior distribution can be extremely time consuming in such Bayesian setting?

  3. One of the motivation of the algorithm is to deal with the situation with rare classes. How does the algorithm perform in such case? I suppose in this case, the choice of prior distribution should affect the performance severely? Will it make the proposed algorithm less useful in such extreme condition?

局限性

Please see the questions above, especially on the question for the rare classes case.

最终评判理由

I would like to thank the author for their response. In this case, I would like to keep the score: 4 borderline accept.

格式问题

None.

作者回复

First of all, we would like to thank you for your very constructive comments. The reviewer listed three questions with numbers, and we will respond to each of them.

1. “How is the similarity graph built, and what if it is noisy?”

We build a k‑NN graph once from off-the-shelf embeddings (CLIP for CIFAR, averaged penultimate‑layer features for MNIST). Noisy edges are inevitable, and that is exactly why the prior places a continuous Laplacian-precision τqL\tau_q L instead of a hard constraint.

Specifically, Prop 2 (Robustness to Laplacian misspecification) quantifies the bias when LL0L \neq L_0:

θˉNθ02τqNλmin(F0)+τqλ2(L)(LL0)θ02+OP(N1),|| \bar{\theta}_N - \theta_0||_2 \leq \frac{\tau_q}{N \lambda{\min}(F_0) + \tau_q \lambda_2(L)}|| (L - L_0) \theta_0 ||_2 + O_P(N^{-1}),

so the bias shrinks as N1N^{-1} and is damped when the graph is well‑connected (λ2(L)\lambda_2(L)\uparrow). Thus moderate graph noise merely slows convergence, and it does not break consistency.

Based on your feedback, we realized that we could better clarify the motivation for our formulation by adding the following additional direct corollary.

Corollary. From Prop. 2 set NN\to \infty to bound the asymptotic bias by τqλ2(L)1(LL0)θ0\tau_q\lambda_2(L)^{-1}\|(L-L_0)\theta_0\|.

In addition, we provide the additional ablations on graph construction. In this experiment, we consider the following strategies.

  • CLIP-text (already in the current manuscript): cosine kk-NN on text prompts.
  • Penultimate-image: Euclidean kk‑NN on frozen ResNet‑18 penultimate features (identical backbone as the classifier).
  • Random-embeddings: same sparsity pattern as CLIP‑text but edge weights i.i.d. U(0,1)U(0, 1) (destroys semantics, preserves Laplacian spectrum scale).
  • Identity graph: L=0L = 0, switches the hierarchy off (reduces to independent logistic‑Normal prior and hence Bayesian BBSE with no smoothing).

The results are as follows.

datasetgraphq^q1\|\|\hat{q} - q\|\|_1 \downarrowacc \uparrow
CIFAR-10CLIP-text0.025 ±\pm 0.0040.844 ±\pm 0.003
penult-img0.029 ±\pm 0.0050.839 ±\pm 0.004
random0.043 ±\pm 0.0090.820 ±\pm 0.006
identity0.051 ±\pm 0.0080.813 ±\pm 0.004
BBSE (freq.)0.112 ±\pm 0.0150.781 ±\pm 0.004
CIFAR-100CLIP-text0.22 ±\pm 0.020.783 ±\pm 0.005
penult-img0.27 ±\pm 0.030.771 ±\pm 0.006
random0.46 ±\pm 0.050.735 ±\pm 0.009
identity0.70 ±\pm 0.070.730 ±\pm 0.010
BBSE (freq.)1.62 ±\pm 0.050.690 ±\pm 0.006

Remarks

  • Any connected graph (λ2(L)>0\lambda_2(L) > 0) already helps over the frequentist point estimate; cf. the “random” line.
  • The more task-aligned the similarity (CLIP > penult-img > random), the lower the bias term.
  • The identity graph reproduces Bayesian BBSE with no smoothing and is almost identical to MLLS. Its gap to CLIP shows the specific benefit of our graph-coupled hierarchy.

Why do the graph ablations behave as the above?

Recall the posterior variance bound we already proved:

Var(qidata)Cλ2(L)N,\text{Var}(q_i \mid \text{data}) \leq \frac{C}{\lambda_2(L)N},

which holds for any connected graph.

If we write the mean-squared error as

E[(q~iqi)2]=(E[q~i]qi)2+Var(q~i),\mathbb{E}[(\tilde{q}_i - q_i)^2] = (\mathbb{E}[\tilde{q}_i] - q_i)^2 + \text{Var}(\tilde{q}_i),

then, it shows the variance term shrinks by a factor 1/λ2(L)1 / \lambda_2(L).

Even a kk-NN graph whose edge weights are i.i.d. U(0,1)\mathcal{U}(0, 1) has, with high probability,

λ2(Lrandom)ckK,\lambda_2(L_{\text{random}}) \gtrsim c\frac{k}{K},

for kclnKk \geq c’ \ln K. Hence a purely random graph already cuts the class-wise standard error by K/(ckN)\sqrt{K / (ckN)} compared with the identity graph λ2=0\lambda_2 = 0. This explain why the “random” row in the ablation table still beats the deterministic frequentist BBSE (it keeps all the variance relief while paying only a bias penalty).

2. “Computational complexity vs. vanilla BBSE?”

We measure work in floating-point operations (flops) and write KK = number of classes and E|E| = number of graph edges.

For the kk-nearest neighbor graphs we use, E=kK|E| = kK with a small, KK-independent constant kk (4 for CIFAR‑10, 8 for CIFAR‑100).

  • Vanilla BBSE: The point estimator solves a single linear system q^=C~1r~\hat{q} = \tilde{C}^{-1}\tilde{r}, where C~RK×K\tilde{C}\in\mathbb R^{K\times K} is dense once any amount of Laplace or Tikhonov regularisation is applied. Gaussian elimination therefore costs O(K3)\mathcal{O}(K^3) flops and O(K2)\mathcal O(K^{2}) memory.
  • GS‑B3^3SE (one Newton–CG sweep): Write Φ=(ϕ1,,ϕK)RK×(K1)\Phi = (\phi_1,\dots,\phi_K)^\top \in \mathbb{R}^{K\times (K-1)} and θRK1\theta \in \mathbb{R}^{K-1}, both constrained to the centered subspace v1=0\\{v^\top 1=0 \\}. At each sweep we perform three algebraic operations:
    • update each column ΦΦHΦ1L\Phi \leftarrow \Phi - \mathcal{H}_{\Phi}^{-1}\nabla\mathcal{L},
    • update θ\theta analogously, and
    • Gamma shape/scale updates for τq\tau_q, τC\tau_C.

Adding them and using E=kK|E|=kK, one sweep = O(EK+K2)O(kK2)\mathcal{O}(|E|K + K^2) - \mathcal{O}(kK^2), and memory usage is O(E+K2)\mathcal{O}(|E| + K^2).

Overall complexity.

Thm. 2 establishes that F(q)F(q) is α\alpha-strongly geodesically convex with α=nλmin(F0)+τqλ2(L)>0\alpha = n’\lambda_{\min}(F_0) + \tau_q \lambda_2(L) > 0. Consequently a back-tracking Newton method requires T=O(ln1/ϵ)T = \mathcal{O}(\ln 1/\epsilon) sweeps to reach tolerance ϵ\epsilon.

Hence the total complexity is

O(kK2ln1/ϵ)vs.O(K3) (Vanilla BBSE).\mathcal{O}(kK^2 \ln 1 / \epsilon)\quad \text{vs.}\quad \mathcal{O}(K^3)\ (\text{Vanilla BBSE}).

Because kk is at most 8 and ln1/ϵ10\ln 1/\epsilon \le 10 in all our runs, the crossover happens already at K25K\ge 25.

Here, we report the wall-clock runtimes (single NVIDIA T4, PyMC 5.10, float32).

datasetKBBSEMLLSours
MNIST100.80 s1.12 s12.2 s
CIFAR-1001002.88 s5.37 s54.5 s

3. “Behavior with very rare classes; isn’t everything prior-driven?”

For BBSE every class ii is treated in isolation: the plug-in variance satisfies

VarBBSE(q^i)=O((niS)1),\text{Var}_{\text{BBSE}}(\hat{q}_i) = \mathcal{O}((n_i^S)^{-1}),

so as soon as the validation set contains niS=O(1)n_i^{\mathrm S}=O(1) points the error explodes.

In our hierarchy the Gaussian-Markov random field ties all columns of the confusion matrix and the prior vector together through the graph Laplacian; information can therefore flow from frequent to infrequent neighbors. Algebraically this shows up in the curvature of the log-posterior:

2logp(q,Cdata)=NF0+τqL+τC blockdiag(L,,L),-\nabla^2\log p(q, C \mid \text{data}) = N F_0 + \tau_q L + \tau_C\ \text{blockdiag}(L,\dots,L),

with the data Fisher block NF0N F_0 pooling global target counts, and the prior blocks injecting λ2(L)\lambda_2(L)-weighted curvature even when niS=0n_i^{\mathrm S}=0.

Corollary 1 states, for any class ii,

Var(qi)Cλ2(L)N.\mathrm{Var}(q_i) \leq \frac{C}{\lambda_2(L)N’}.

Thus the worst-case posterior variance still decays like N1N^{-1}, and the only penalty for a vanishing niSn_i^{\mathrm S} is the 1/λ2(L)1/\lambda_2(L) factor, i.e. how well the rare label is connected to the rest of the ontology.

Specifically, we can analyze the algorithms under a Zipf-like class-frequency law.

Suppose that source class priors follow a Zipf law

pi=ibHK,b,HK,bj=1Kjb, b>1.p_i = \frac{i^{-b}}{H_{K,b}}, \quad H_{K,b} \coloneqq \sum^K_{j=1}j^{-b},\ b > 1.

The labelled-source sample sizes are therefore

niS=NSpi=NSibHK,b,NS=jnjS.n_i^S = N_S p_i = N_S \frac{i^{-b}}{H_{K,b}}, \quad N_S = \sum_j n_j^S.

Write N=NS+nN = N_S+n’ with NS/Nρ(0,1)N_S/N\to\rho\in(0,1). Target priors qq are arbitrary but miniqi>0\min_i q_i>0, and the frozen classifier's confusion matrix CC is assumed invertible with condition number κ:=C12C2<\kappa:=||C^{-1}||_2 || C ||_2<\infty.

Denote C^\hat{C} and r^\hat{r} the empirical confusion matrix and target prediction histogram. Linearizing the BBSE estimator around the truth gives the usual delta-method expansion:

q^q=C1(C^C)q+C1(r^r)+RN,\hat{q} - q = - C^{-1}(\hat{C} - C) q + C^{-1}(\hat{r} - r) + R_N,

with RN2=o(N1/2)||R_N||_2 = o(N^{-1/2}).

Here, r^Mult(nr)\hat{r} \sim \text{Mult}(n' r) with r=Cqr= C q, and hence

Var([C1(r^r)]i)=1neiC1[diag(r)rr](C1)eiκ24n.\text{Var} \bigl([ C^{-1}(\hat{r}- r)]_i \bigr) = \frac{1}{n'} e_i^{\top} C^{-1} \bigl[\mathrm{diag}(r) - r r^{\top}\bigr] (C^{-1})^{\top} e_i \leq \frac{\kappa^2}{4n'}.

For label ii the column C,i^\hat{C_{\cdot, i}} is a multinomial vector with size niSn^S_i and mean C,iC_{\cdot,i}. Let

ΔiC,i^C,i,Cov(Δi)=1niS[diag(C,i)C,iC,i].\Delta_i \coloneqq \hat{C_{\cdot,i}} - C_{\cdot,i}, \quad \text{Cov}(\Delta_i) = \frac{1}{n_i^S} \left[\text{diag}(C_{\cdot,i}) - C_{\cdot,i}C_{\cdot,i}^\top \right].

Because only column ii depends on niSn_i^{\mathrm S}, its contribution to the target‑prior error is

C1ΔiqiVar([C1Δiqi]j)κ2qi24niS.C^{-1}\Delta_i q_i \Longrightarrow \text{Var}([C^{-1}\Delta_i q_i]_j) \leq \frac{\kappa^2 q_i^2}{4n_i^S}.

For a rare label ii we typically have niSnn_i^S \ll n’, and we obtain

E[(q^iqi)2]=O(ib/N)(Vanilla BBSE).\mathbb{E}[(\hat{q}_i - q_i)^2] = \mathcal{O}(i^b / N) \quad (\text{Vanilla BBSE}).

Thus the MSE deteriorates polynomially in the rank ii.

For our graph-smoothed Bayesian estimator, on the other hand, from Cor.1 and Prop.2 we have

E[(q~iqi)2]=Cλ2(L)N+O(N2).\mathbb{E}[(\tilde{q}_i - q_i)^2] = \frac{C}{\lambda_2(L)N} + O(N^{-2}).

Thus, the risk no longer depends on the tail exponent bb or the rank ii.

评论

Hello again,

While I understand that you are not an expert in this topic, please read the Authors' feedback and prepare a suitable short reply to them. It will be really helpful.

Regards,

AC

最终决定

Summary:

This submission addresses the problem of label shift; a problem that arises in practice after deploying ML prediction models. This work extends the vanilla model for detecting label shift, namely the Black Box Shift Estimator (BBSE), to a fully Bayesian framework to address two limitations in BBSE: (i) its sensitivity to sampling noise, and (ii) its ignorance to potential semantic similarity between different classes, especially when the number of classes is large.

The new proposed method, GS-B3SE, is a fully probabilistic alternative to BBSE. In particular, introduces probabilistic Laplacian-Gaussian priors on both the target distribution log-priors (pp), and columns of the confusion matrix CC, tying them together on a label similarity graph that captures the semantic information between classes. Thus, the class similarity graph acts as a smoothing constraint on these priors. In particular, the graph Laplacian is used as the covariance matrix in the distribution of latent variables. The graph structure allows semantically similar classes to borrow some information from each other during parameter estimation. In this way, the model can handle classes with sparse data, and the resulting estimates are more robust thanks to the proper quantification (and handling) of uncertainty. In some sense, the vanilla BBSE can be seen as a special case from the GS-B3SE.

Comments

  • All reviewers agree on the following strengths of the submission:

    1. The paper is well-written, easy to follow, and proposes a novel, mathematically sound, formulation for label shift estimation. The use of the graph structure to simultaneously smooth the priors (for the target distribution and the confusion matrix) is a clever idea.
    2. The theoretical analysis is robust: complete proofs for identifiability, optimal N1/2N^{-1/2} convergence, and variance reduction scaling with algebraic connectivity λ2(L)\lambda_2(L).
    3. The information geometric analysis provided by the authors (sec. 6.1) reveals how graph priors act as ee-convex barriers, linking optimization to Riemannian dynamics.
  • All concerns, questions, and confusions raised by the reviewers were properly and thoroughly addressed by the authors, and the final results is there is unanimous agreement between all Reviewers that the rating is weakly accept. All reviewers shared a common concern: the computational overhead for GS-B3SE is above moderate. While this might be true especially while the method is still under development, this concern is secondary when compared to the other strengths of this submission, and the merit of this idea in general. Indeed, after the authors' rebuttal, none of the reviewers decreased their ratings due this concern. In fact, Rev.3 (S2qe) who initially assessed the paper as weak reject, increased the rating after the rebuttal to weak accept.

  • To summarize: this is a solid submission for a highly relevant problem to the ML community, with a novel idea, sufficient empirical validation, and strong theoretical justification. The AC recommends acceptance for this submission.