PaperHub
5.8
/10
Poster4 位审稿人
最低5最高7标准差0.8
7
5
5
6
3.3
置信度
正确性3.0
贡献度2.8
表达3.0
NeurIPS 2024

Dissecting the Failure of Invariant Learning on Graphs

OpenReviewPDF
提交: 2024-05-14更新: 2025-01-03

摘要

关键词
Machine LearningOut-of-distribution GeneralizationInvariant LearningGraph Machine Learning

评审与讨论

审稿意见
7

The authors propose a novel regularization and alignment mechanism for invariant graph learning with the goal of obtaining better out-of-distribution (OOD) generalization. Besides proving that existing methods tackling OOD generalization in other domains do not transfer to graphs, they propose CIA (environment-label dependent) and CIA-LRA (environment-label independent) to tackle the challenge of OOD generalization on graph data. Both, theoretical and empirical evidence is provided showing that the proposed methods provide benefits in OOD generalization.

优点

  • theoretical derivation of failure of IRM and VREx on graph data
  • theoretical analysis of CIA an CIA-LRA
  • empirical evidence that the proposed methods achieve OOD generalization significantly better than IRM and VREx
  • detailed discussion of further findings and proofs in appendix

缺点

Methodology

  • in Fig. 1 it is not clearly described what the variables exactly refer to (e.g. GG is not mentioned in the text)
  • while I appreciate the theoretical findings, the GNN defined in Eq. 3 is a linear model, hence the theoretical analysis is somewhat limited. However, one should keep in mind that the analysis of non-linear and highly complex networks is hard
  • line 204: it is a bit counter-intuitive why invariant node features of nodes with the same class should significantly diverge when they are far apart in the graph. Also the empirical evidence in Fig. 7/8 does not really support this statement.
  • a brief definition/discussion of concept shift and covariate shift would be beneficial for clarity
  • in line 232: why is a sum operator employed here instead of a mean? Won't a simple sum be highly influenced by the graph size and connectivity/density?
  • line 248: inconsistent naming (xcaux_{cau} and xinvx_{inv})
  • in theorem 4.2 it is unclear what α\alpha is and what the inequation 0<α<140 < \alpha < \frac{1}{4} means. Further a bit of a discussion on the goodness/tightness of the given bound would be good to assess the usefulness of that bound
  • proof G 1.1.: the proof only shows that θ2=0\theta_2 = 0 is a possible solution, but not a unique solution, isn't it? (Didn't check the proofs too deeply though)
  • Eq. 34: the entire term does not depend on θ2\theta_2, so it is unclear to me why θ2=0\theta_2 = 0 is the solution to that equation. Wouldn't any arbitrary value be a valid solution then?
  • Eq. 35: in App. F.1 it is stated that the expectation of ϵe\epsilon^e is 0. Looking at Eq. 35, won't all mutliplicative terms entailing ϵe\epsilon^e cancel out in the expectation and thus in the IRM loss? This would make the IRM loss independent of the environment in expectation (not for some specific environment though). Thus, I don't see how the conclusion is reached that the IRM loss depends on ϵe\epsilon^e if, in expectation, it is 0.
  • line 995: which property are you referring to exactly?

Experiments

  • since the authors present an alignment-based invariant learning strategy for graph data, it would be good to see some more recent baselines taking a similar approach such as [1] and [2]
  • line 296: why does CIA-LRA improve upon CIA? To me it is counter-intuitive since CIA-LRA has no access to the environment labels, thus I'd expect this task to be harder

General notes

  • in the Introduction there should be more on related methods based on representation alignment such as [1] and [2]

References

[1] Shift-Robust GNNs: Overcoming the Limitations of Localized Graph Training Data. Zhu et al. NeurIPS 2021.

[2] Stable Prediction on Graphs with Agnostic Distribution Shifts. Zahng et al. KDD 2023.

问题

  • how often did you repeat the experiments? (Maybe it's written somewhere and I didn't see it)

局限性

  • major parts of the analysis are rather limited as they rely on the definition of the GNN in Eq. 3 which is a linear model
  • a comparison to some more recent baseline would increase the value of the paper
作者回复

We thank Reviewer nx4e for your careful reading and detailed comments! We'd like to address your concerns in the following points:


Q1: G in Fig. 1 is not mentioned in the text

A1: Sorry for the lack of clarity. GG refers to the graph data.

Q2: limitation: the definition of the GNN in Eq. 3 which is a linear model

A2: Justifications for analyzing a linear model: 1) Some recent works [8] [9] observed that linear GNNs achieve comparable performance to nonlinear GNNs. [5] also theoretically proved that SGC can outperform GCN under some mild assumptions. 2) many recent works on the theoretical analysis of graphs/OOD generalization adopt linear networks ([4] [10] [11]). 3) Our theory matches the experimental results on the nonlinear GCN and GAT that CIA outperforms IRM and VREx.

Q3: line 204: it is a bit counter-intuitive why invariant node features of nodes with the same class should significantly diverge when they are far apart in the graph.

A3: Although the invariant features of the same class are basically invariant across environments, they still show some slight intra-class deviation (see Fig. 7-10). In the same class, the invariant feature difference between nodes that are farther apart is larger than the difference between nodes that are closer together. In Fig. 7/8, despite the curve's slight fluctuations, the invariant feature difference shows a clear positive correlation with the distance from the starting point.

Q4: a brief definition/discussion of concept and covariate shift would be beneficial for clarity

A4: Due to word limitation, we invite you to refer to the A1 to the reviewer yNy3.

Q5: in line 232: why is a sum operator employed here instead of a mean?

A5: We employ a 'sum' in line 232 because ricr_i^c is the ratio of nodes of class cc in the LL-hop neighborhood, which has already been normalized by the size of the neighborhood. It won't be affected by the graph size/connectivity.

Q6: in theorem 4.2 it is unclear what alpha\\alpha is and what the inequation 0<alpha<frac140<\\alpha<\\frac{1}{4} means. Discussion on the goodness/tightness of the given bound?

A6: Sorry again for the lack of clarity. alpha\\alpha is defined in Assumption 3 of [6]: "...... Assume that there exists some 0<alpha<frac140<\\alpha<\\frac{1}{4} satisfying operatornamePrhsimPleft(mathcalLmgamma/4(h)mathcalL0gamma/2(h)>N0alpha+cKepsilonmleftlvert,ThLepsilonm>fracgamma8right.right)leqeN02alpha\\operatorname{Pr}_{h \\sim P}\\left(\\mathcal{L}_m^{\\gamma / 4}(h)-\\mathcal{L}_0^{\\gamma / 2}(h)>N_0^{-\\alpha}+c K \\epsilon_m \\left\\lvert\\, T_h^L \\epsilon_m>\\frac{\\gamma}{8}\\right.\\right) \\leq e^{-N_0^{2 \\alpha}}". This assumption is needed for proving Lemma 6 in [6], which we rely on to prove of Lemma G.11 in our paper.

  • Tightness of Theorem 4.2: when there are no distributional shifts in spurious node features and heterophilic neighborhood distribution between training and test environments, the terms (a)-(d) in Eq. 109 becomes zero, and the upper bound becomes \\widehat{\\mathcal{L}}\_{e^{\\text{tr}}}^\\gamma(\\tilde{h})+const=\\widehat{\\mathcal{L}}\_{e^{\\text{tr}}}^\\gamma(\\tilde{h})+\\frac{1}{N_0^{1-2 \\alpha}}+\\frac{1}{N_0^{2 \\alpha}} \\ln \\frac{L C\\left(2 B_{e^{\\text{te}}}\\right)^{1 / L}}{\\gamma^{1 / L} \\delta}, i.e., our bound only larger than the ideal error \\widehat{\\mathcal{L}}_{e^{\\text{tr}}}^\\gamma(\\tilde{h}) by a constant constconst. When the number of training samples N0N_0 is large, constconst will be small enough and can be negligible. Hence, the tightness of our bound is guaranteed.

Q7: proof G 1.1.: the proof only shows that theta2=0\\theta_2=0 is a possible solution, but not a unique solution, isn't it?

No. theta2=0\\theta_2=0 is a unique solution. Please see lines 855-861, line 868.

Q8: **Eq. 34: the entire term does not depend on theta2\\theta_2, so it is unclear why theta2\\theta_2 is the solution to that equation. **

A8: Eq. (34) is the result of plugging theta2=0\\theta_2=0 into the expression fracpartialmathbbVe[R(e)]partialtheta2\\frac{\\partial\\mathbb{V}_e[R(e)]}{\\partial \\theta_2}, thus theta2\\theta_2 disappears.

Q9: won't all mutliplicative terms entailing epsilone\\epsilon^e cancel out in the expectation?

A9: Although mathbbEe[epsilone]=0\\mathbb{E}_e[\\epsilon^e]=0, but \\mathbb{E}_e[{\\epsilon^e}^\\top\\epsilon^e]=N_e\\sigma^2>0, where sigma2=mathbbVe[epsilonie], i=1,...,Ne\\sigma^2=\\mathbb{V}_e[\\epsilon^e_i],~i=1,...,N_e. So the terms multiplied by {\\epsilon^e}^\\top\\epsilon^e won't be canceled out.

Q10: line 995: which property are you referring to exactly?

A10: For two vectors aa and bb (aneqlambdaba\\neq \\lambda b, lambda\\lambda is any scalar), we have acdotb2leqa2b2|a\\cdot b|^2 \\leq |a|^2|b|^2.

Q11: it would be good to see some more recent baselines such as [1] and [2]

A11: We have already included [1] (SRGNN) in our main experiment in Table 2 of the orginal paper. We didn't compare with [2] since it requires the graphs of different environments to have the same nodes but our datasets don't satisfy this. However, we add two more recent graph OOD methods CIT ([7] suggested by reviewer LzeD) and CaNet ([12] suggested by reviewer empj) as baselines. Please see Table B of the rebuttal PDF. CIA-LRA outperforms CIT and CaNet on all splits.

Q12: line 296: why does CIA-LRA improve upon CIA? It is counter-intuitive since CIA-LRA has no access to the environment labels

A12: We also invite you to refer to A6 to the reviewer yNy3.

Q13: missing related works

A13: We've already discussed [1] in Appendix A.3. Here we discuss [2]. [2] proposed to align the aggregation weights of the same edge across environments ("locally stable learning") and align the cross-environment loss ("global stable learning"). There are some drawbacks. First, the "locally stable learning" requires the graphs from different environments to have the same nodes and topological structures, which is impractical; second, the "global stable learning" is actually a VREx loss that we proved to have failure cases in node-level OOD tasks.

Q14: repeating experiments & typos

A14: We repeat our experiments with three different random seeds. We have fixed all typos you mentioned.

评论

Thank you for the detailed rebuttal!

Although I did not check A6 in detail, it explains where α\alpha is coming from.

The rebuttal clarified all other points, and the results of the additional experiments look good.

Thus, I raised my score from 5 to 7.

评论

Thank you for appreciating our work and rasing the score! Thank you for your time, effort, and valuable feedback during the review process!

审稿意见
5

The paper investigate the failure case of IRMv1 and VRex for graph data, and develop a novel approach CIA for node-level OOD generalization. To adapt for the scenarios without environment labels, the authors propose CIA-LRA for the datasets without environment labels.

优点

  1. The paper is well-written and provides clear motivation.

  2. The paper is solid in both theoretical and empirical aspects.

  3. The evaluations are fair and convincing.

缺点

  1. The author may want to clarify the reasoning behind the two SCMs corresponding to the two types of distribution shifts, as this connection is not entirely clear.

  2. In Equation 9, it should be ϕθ(j)\phi_\theta(j).

  3. Assuming the sample label corresponds to the same invariant factors, it is unclear why rsamer_{\text{same}} is necessary for reweighting.

  4. It appears that CIA-LRA requires all labels for reweighting alignment. However, the typical node-level classification setting is semi-supervised, which limits the applicability of this method.

  5. Line 231: "to pairs with" should be corrected to "to pair with".

  6. Line 854: In the proof of the non-graph case, (Ae)k(A^e)^k appears, which seems to be an oversight.

  7. The general idea of sampling nodes from the same target label and different environments is similar to CIGA [1], which performs supervised contrastive learning to identify invariant subgraphs. The objective in CIGA may be a specific form of d()d(\cdot). The authors may want to compare and contrast their approach with CIGA.

reference

[1] Chen et al., Learning Causally Invariant Representations for Out-of-Distribution Generalization on Graphs, Neurips 2022.

问题

  1. CIA-LRA learns masked subgraph, is the experiment inductive or transductive?

  2. In line 318-321, why CIA will be less performant than CIA-LRA, it seems not reasonable, as CIA will adopt environment labels while CIA-LRA doesn’t.

  3. θm\theta_m is separated from θ\theta, how to ensure it will learn the invariant subgraph, under what assumptions?

局限性

N/A

作者回复

We thank Reviewer yNy3 for your careful reading and detailed comments! We'd like to address your concerns in the following points:


Q1: The author may want to clarify the reasoning behind the two SCMs corresponding to the two types of distribution shifts, as this connection is not entirely clear

A1: Here we'd like to clarify the connection between the SCMs and the two type of distributional shifts.

Definitions of the two shifts:

Concept shift: pYS(ys)p_{Y|S}(y|s) changes across environments, and pX(x)p_X(x) is invariant across environments; Covariate shift: pX(x)p_X(x) changes across environments, and pYS(ys)p_{Y|S}(y|s) is invariant across environments.

Concept shift and Fig. 1(a): Suppose X=f(C,S)X=f(C, S). In Fig. 1(a), SS is the direct cause of YY and the mapping from YY to SS is affected by environments EE. This means pSY(sy)p_{S|Y}(s|y) changes across environments. Now we show that this makes pYX(yx)p_{Y|X}(y|x) varies with environments. pYX(yx)=fracpX,Y(x,y)pX(x)=fracsumc,s:x=f(c,s)pYmidC,S(ymidc,s)cdotpC,S(c,s)pX(x)p_{Y|X}(y|x)=\\frac{p_{X,Y}(x, y)}{p_X(x)}=\\frac{\\sum_{c, s: x=f(c, s)} p_{Y \\mid C, S}(y \\mid c, s) \\cdot p_{C, S}(c, s)}{p_X(x)} . Taking out the pYmidC,S(ymidc,s)p_{Y \\mid C, S}(y \\mid c, s) in the molecule, we have pYmidC,S(ymidc,s)=fracpSY(sy)pCY(cy)pY(y)pC,S(c,s)p_{Y \\mid C, S}(y \\mid c, s)=\\frac{p_{S|Y}(s|y)p_{C|Y}(c|y)p_Y(y)}{p_{C,S}(c,s)}. Since pSY(sy)p_{S|Y}(s|y) varies with environments, we conclude that pYX(yx)p_{Y|X}(y|x) is also variant, which matches the definition of the concept shift.

Covariate shift and Fig. 1(b): In Fig. 1(b), SS is independent of YY and only depends on EE, which means pYX(yx)p_{Y|X}(y|x) is invariant across environments. Additionally, pX(x)=sumc,spXC,S(xc,s)pC,S(c,s)=sumc,s:x=f(c,s)pC,S(c,s)=sumc,s:x=f(c,s)pCS(cs)pS(s)p_X(x)=\\sum_{c,s}p_{X|C,S}(x|c,s)p_{C,S}(c, s)=\\sum_{c,s:x=f(c,s)}p_{C,S}(c,s)=\\sum_{c,s:x=f(c,s)}p_{C|S}(c|s)p_{S}(s). According to Fig. 1(b), SS is the direct cause of EE so pS(s)p_{S}(s) changes with environments. Hence, we conclude that pX(x)p_X(x) changes with environments, which matches the definition of the covariate shift.

Q2: Assuming the same label corresponds to the same invariant factors, it is unclear why rsamer^{same} is necessary for reweighting.

A2: In practice, although the causal features of the same class are basically invariant across environments, they still show slight deviation among the samples within this class (shown in Fig. 7-10), and this intra-class discrepancy in invariant features has a positive correlation with the difference in same-class neighborhood label distribution (evidence in Fig. 15). Hence, we design rsamer^{same} to prevent the over alignment that may cause the collapse of the invariant features. Table 4 and Fig. 2 reveal the positive role of rsamer^{same}.

Note that the main purpose of this assumption is for the ease of theoretical analysis. However, this assumption is also acceptable in practice, supported by two pieces of evidence: 1) CIA improves over most non-graph-specific baselines 2) In Table 4, merely removing rsamer^{same} (63.91) from CIA-LRA still outperforms IRM (61.14) and VREx (61.32).

Q3: the typical node-level classification setting is semi-supervised, which limits the applicability of this method.

A3: Our experiments are exactly conducted in a semi-supervised setting in which only 30%~40% of nodes are labeled. CIA-LRA only operates on the labeled data to gain improvements.

Q4: The authors may want to compare and contrast their approach with CIGA.

A4: Differences between CIA-LRA and CIGA:

  • CIGA cannot solve covariate shifts. CIGAv2 maximizes the mutual information between the estimated spurious subgraph hatGs\\hat{G_s} and labels I(hatGs;Y)I(\\hat{G_s};Y) to reduce including part of hatGs\\hat{G_s} into the true invariant graph GcG_c. For covariate shifts, GsG_s contains no information about YY, thus maxI(hatGs;Y)\\max I(\\hat{G_s};Y) may not capture the correct GsG_s. Therefore, the estimation of the invariant subgraph hatGc=GhatGs\\hat{G_c}=G-\\hat{G_s} will be also inaccurate.
  • CIA-LRA works for both shifts. Our strategy doesn't rely on the concept shift assumption to minimize the OOD error. There are two main reasons. First, under covariate shifts, CIA-LRA removes environment-related spurious node features when the selected pairs potentially come from different environments. Second, CIA-LRA can simultaneously reduce the error caused by structural shifts of heterophilic neighborhood label distribution (the term (c) in the error bound in Theorem 4.2 can be minimized by CIA-LRA regardless of the type of the distributional shifts).

Q5: CIA-LRA learns masked subgraph, is the experiment inductive or transductive?

A5: All four datasets of our experiments are transductive.

Q6: In line 318-321, why CIA will be less performant than CIA-LRA, it seems not reasonable, as CIA will adopt environment labels while CIA-LRA doesn’t.

A6: The role of environment labels is to distinguish spurious features so that the they can be eliminated by cross-environment alignment. However, even if spurious features are removed (Fig. 2 right, CIA), learning a collapse representation of invariant features can also hurt generalization (shown in Fig. 2 Left and Mid, CIA). Although CIA-LRA doesn't use environment labels, it can remove spurious features by intra-class alignment and the proposed rdiffr^{diff} (see Fig. 2 right, CIA-LRA and term (b) (c) in Theorem 4.2). Moreover, it avoids the collapse of invariant features caused by overalignment by only aligning local pairs and using rsamer^{same} (Fig. 2 Mid, CIA-LRA). Therefore, CIA-LRA outperforms CIA even without environment labels.

Q7: thetam\\theta_m is separated from theta\\theta, how to ensure it will learn the invariant subgraph, under what assumptions?

A7: We are sorry that we did not fully understand which parameter in Eq. (3) the "thetam\\theta_m" you mentioned here refers to. We'd like to address any of your further questions.

Q8: typos

A8: Thank you for carefully identifying all typos! and we have fixed all typos you mentioned:

  • Eq. (9): phi(i)rightarrowphi(j)\\phi(i)\\rightarrow \\phi(j)
  • Line 231: "to pairs with" rightarrow\\rightarrow "to pair with".
  • Line 854: Removed tildeAek{\\tilde{A}^e}^k
评论

Thank you for the response. While your reply addresses some of my questions, I have a few points of disagreement with certain aspects of your response.

SCM and the types of distribution shifts

  • For concept shift, the authors aim to show pX(x)p_X(x) is invariant across environments, however, XX depends on SS, which is affected by EE. Hence this claim appears to be invalid.

  • Similarly for covariate shift, P(yx)P(y|x) won't be invariant across environments, as XX depends on EE.

Comparison with CIGA

Figure 1(a) is equivalent to PIIF in CIGA, and Figure 1(a) is equivalent to FIIF, hence CIGA can solve both situations in Figure (a) and (b).

Collapsed learning

I'm also not convinced that with environment labels, the learned representations will collapse. With environment labels, the method could facilitate better differentiation among various groups, enhancing alignment.

Similarity with previous work

I still think this approach is quite similar to CIGA, which is an adaptation for node-level tasks, with some reweighting scheme for a single graph.

评论

Thank you for your further comments. We'd like to address your concerns as follows:

Q9: SCM and the types of distribution shifts

A9: Thank you pointing out the careless mistake. The accurate claim should be: under concept shift, pC(c)p_C(c) is invaraint; under covariate shift, pYC(yc)p_{Y|C}(y|c) is invariant. However, this misrepresentation doesn't affect the correctness of the derivation in A1, because it doesn't rely on the assumption that certain distribution is invariant across environments.

Q10: Comparison with CIGA

A10: I guess you might mean "Fig. 1(a) is equivalent to PIIF, Fig. 1(b) is equivalent to FIIF". However, this is not the case. Although PIIF is equivalent to Fig. 1(a) in our paper, FIIF is not equivalent to Fig. 1(b) in our paper because CSC\rightarrow S exists in FIIF, causing CC to correlate with SS. However, in Fig. 1(b), CC and SS are independent. In summary, both FIIF and PIIF represent concept shifts where invariant features correlate with spurious ones, as highlighted in Section 2.2 of the CIGA paper [1]: "SS is directly controlled by CC in FIIF and indirectly controlled by CC through YY in PIIF". However, our paper also considers the covariate shift where CC and SS are independent.

Q11: Collapsed learning

A11: We kindly remind you that we are not saying "with environment labels, the learned representations will collapse", we actually mean excessive alignment will lead to the collapse of the invariant features. We use the intra-class variance of the representation corresponding to invariant features (averaged over all classes) to measure the degree of collapse of invariant features.

Base on this measurement, the excessive alignment can be caused by:

  1. Using a λ\lambda that is too large. Evidence: on the toy dataset of Fig. 2, a larger λ\lambda leads smaller intra-class variance of invariant representations

    Table D: the intra-class variance of invaraint representations at epoch 50

    CIAλ=0.05\lambda=0.05λ=0.1\lambda=0.1λ=0.5\lambda=0.5
    variance0.0610.0390.011
  2. Aligning the representations of too many nodes.

    Evidence 1: we add an experiment to show that aligning fewer node pairs can prevent the collapse of invariant representation, even with environment labels. By modifying CIA to align local pairs (same-class, different-environment nodes within 1 hop), termed "CIA-local", the results in Table E show that when by aligning local pairs instead of all pairs, CIA-local avoids the collapse that CIA suffers and achieves better performance.

    Table E: accuracy and the variance of the invariant representations on the toy dataset of Fig. 2. at epoch 200

    CIACIA-local
    Concept, Acc.0.2530.354
    Concept, variance0.00030.2327
    Covariate, Acc.0.2500.312
    Covariate, variance0.00020.1699

Q12: Similarity with previous work

A12: Additional differences between CIA-LRA and CIGA:

  • CIA-LRA solves both concept and covariate shift but CIGA is only guaranteed to solve the concept shift (PIIF and FIIF). As we pointed out in A10 and A4, CIA-LRA can solve both concept (Theorem 3.1, 4.2) and covariate shift (Propsition B.3, Theorem 4.2). CIGA relies on the assumption that the spurious subgraph GsG_s and the invariant subgraph GcG_c "can share certain overlapped information about YY" to identify them, which means it only work well under concept shift. Their theoretical guarantee is under the PIIF and FIIF assumptions (Theorem 3.1 in [1]).

  • CIA-LRA utilize the fine-grained graph-specific features (neighborhood label distribution, NLD) to better eliminate spurious features but CIGA ignores such information and only consider loss-level regularization. CIA-LRA propose to utilize the NLD to identify node pairs with large discrepancies in spurious features and similar invariant features to better eliminate the former and avoid the collapse of the latter. However, CIGA ignores such fine-grained information and solely minimize the loss of using an spurious subragph G^s\hat{G}_s to predict YY (maxI(G^s;Y)\max I(\hat{G}_s;Y)) and enforcing the loss of G^c\hat{G}_c is smaller than G^s\hat{G}_s's.

  • CIA-LRA additionally consider the issue of the collapse of invariant features while CIGA doesn't. We reveal the collapse of invariant features caused by excessive alignment (Fig 2 Mid and Table D, E in A11) and propose the localized alignment and rsamer^{same} to address this. CIGA focus solely on indentifying the invariant subgraph. Our work provides new insights that excessive alignment can lead to the collpase of invariant representations and hurt generalization, which can help the community better understand the dual role of representation alignment in OOD generalization.

评论

I appreciate the reviewer’s great efforts for the response. However, I still have some concerns.

Q9

I understand the author’s explanation regarding concept shift and covariate shift. However, I believe that the SCM presented in Figure 1 merely represents an underlying data generation process, which may not directly correspond to the types of distribution shifts. This could be potentially misleading.

Q10,12

  • Although there is no edge from C to S in Figure 1b, according to the definition of FIIF, the condition I(S;YC)=0I(S;Y|C)=0 still holds.
  • FIIF and PIIF represent data-generating processes, and therefore do not directly correspond to concept shift.
  • CIGA primarily addresses covariate shift, where the conditional distribution P(YG)P(Y|G), i.e., the underlying mechanism, remains unchanged, while P(G)P(G) shifts due to spurious features.
  • I recognize the contribution of this work in addressing excessive alignment. However, the basic idea aligns with supervised contrastive learning, which is why I perceive similarities with previous studies like CIGA.
  • The utilization of techniques such as neighborhood label distribution is indeed novel, but they are not directly applicable to graph-level OOD. This is why I consider CIA-LRA to be an adaptation of supervised contrastive learning for node-level OOD.
  • Overall, I remain positive about this work due to its high-quality writing and rigor. However, I also have concerns about the level of novelty in the proposed method. Additionally, some design choices in the methodology appear heuristic (in Sec. 3.2), which also raises some concerns.
评论

We thank reviewer yNy3 for carefully reading our response. We want to address your further concerns as follows:


Q13: However, I believe that the SCM presented in Fig. 1 merely represents an underlying data generation process, which may not directly correspond to the types of distribution shifts.

A13: Thanks for your advice! We acknowledge that Fig.1 are indeed data generation processes, but our deduction in A1 has demonstrated that under the data generation process in Fig. 1 (a)/(b), the concept/covariate shift will occur. The potential misleading could be that the current caption of Fig.1 is 'concept shift' and 'covariate shift'. We'll modify it to be the "underlying data generation process leading to concept/covariate shift" and add the derivation in A1 to our paper for clarity.

Q14: Although there is no edge from C to S in Figure 1b, according to the definition of FIIF, the condition I(S;YC)=0I(S;Y|C)=0 still holds.

A14: We acknowledge that I(S;YC)=0I(S;Y|C)=0 holds for both Fig. 1(b) in our paper and the FIIF. This means SYCS\perp Y|C holds for both FIIF and our Fig. 1(b), but it doesn't mean SCS\perp C holds for both of them. In fact, according to the SCM of FIIF, SS correlates with CC, but they are independent in Fig. 1(b). Hence, Fig. 1(b) and FIIF are fundamentally different.

Q15: FIIF and PIIF represent data-generating processes and do not directly correspond to concept shift.

A15: FIIF and PIIF are indeed data-generating processes, but we can still deduce that concept shift occurs under such data-generating processes.

FIIF: under FIIF, S=fspu(C,E)S=f_{spu}(C, E), so pSC(sc)p_{S|C}(s|c) changes across environments. Following the derivation in A1, to prove pYX(yx)p_{Y|X}(y|x) changes with environment (i.e., concept shift happens), we only need to prove pYC,S(yc,s)p_{Y | C, S}(y | c, s) is variant. pYC,S(yc,s)=pSC(sc)pYC(yc)pC(c)pC,S(c,s)p_{Y | C, S}(y | c, s)=\frac{p_{S|C}(s|c)p_{Y|C}(y|c)p_C(c)}{p_{C,S}(c,s)} contains pSC(sc)p_{S|C}(s|c) in its numerator, so it also changes with environments.

PIIF: The structure of PIIF is exactly the same as Fig 1(a), so the analysis in A1 can be directly applied to PIIF to show that it encounters a concept shift.

Q16: CIGA primarily addresses covariate shift, where the conditional distribution P(YG)P(Y|G), i.e., the underlying mechanism, remains unchanged.

A16: From the analysis in A15 we can infer that pYX(yx)p_{Y|X}(y|x) changes with environments (which is equivalent to the P(YG)P(Y|G) you refer to here) under the FIIF and PIIF assumption considered by CIGA, so CIGA is not guaranteed to solve covariate shift.

Q17: However, the basic idea aligns with supervised contrastive learning, which is why I perceive similarities with previous studies like CIGA. The utilization of techniques such as neighborhood label distribution is indeed novel, but they are not directly applicable to graph-level OOD. This is why I consider CIA-LRA to be an adaptation of supervised contrastive learning for node-level OOD.

A17: Although the methodology of representation alignment of CIA-LRA is somewhat similar to the supervised contrastive loss of CIGA, adapting such an idea to the node-level task is non-trivial: there may not exist a subgraph containing information about YY for each node, so maxI(Gs^;Y)\max I(\hat{G_s};Y) may not identify the spurious subgraph. Therefore, we propose to utilize the neighborhood label distribution to better distinguish spurious features and eliminate them. In this work, we currently focus on the challenging node-level OOD task, and we'll aim to improve graph-level OOD generalization in the future.

评论

Dear reviewer yNy3, we note that you have lowered your score from 6 to 5. If you have any questions please let us know and we will be happy to answer them!

评论

Thanks for the authors' efforts for constructing the response. Some of my concerns cannot be addressed in the rebuttal stage, such as:

  1. in Q15, under FIIF, since I(S;YC)=0I(S;Y|C)=0, which implies that P(YC,S)=P(YC)P(Y|C,S)=P(Y|C) when conditioned on CC.

  2. The claim that the approach would solve concept shift is also problematic. Most invariant learning methods ever since IRM would assume a stable relations between semantic objects and YY, hence they aim to solve covariate shift but not concept shift. One example would be a cow in a picture would always be mapped to Y=1Y=1, hence it is invariant; Otherwise, in concept shift, a cow would be mapped to a different label, indicating that there is no stable features, but instead, the underlying responsive mechanism changes, rather than the spurious features. The claim that this approach, which is a refinement of supervised contrastive learning, can solve concept shift is not convincing.

  3. Some designs in the approach remain heuristic.

For the above reasons, I won't raise my score, and won't lower my score given the strength of this work. However, I strongly suggest that the authors reconsider the claim that CIA would solve concept shifts, which is my biggest concern for this work.

评论

We sincerely thank reviewer yNy3 for the effort and time during the review and discussion! We are happy to continue clarifying the following issues:


Q18: in Q15, under FIIF, since I(S;YC)=0I(S;Y|C)=0, which implies that P(YC,S)=P(YC)P(Y|C,S)=P(Y|C) when conditioned on CC.

A18: We acknowledge that I(S;YC)=0I(S;Y|C)=0 holds and it leads to P(YC,S)=P(YC)P(Y|C,S)=P(Y|C) in both FIIF and Fig 1(b). However, this doesn't affect the fact that CSC\rightarrow S holds in FIIF. Since CSC\rightarrow S and ESE\rightarrow S, the spurious features SS will be determined by the causal features CC, and this mapping from CC to SS changes across environments (this can also be seen from the Assumption 2.2 in CIGA: S:=fspu(C,E)S:=f_{spu}(C, E)). Therefore, in FIIF a specific causal feature will correlate (for example, emerge simultaneously) with a specific spurious feature and this correlation changes across environments.

However, in Fig. 1(b) of our paper, CC and SS are independent. As a result, FIIF and Fig. 1(b) represent different kinds of distribution shifts.

Q19: The claim that the approach would solve concept shift is also problematic.

A19: Sorry, it seems we may have had some misunderstanding regarding the definitions of the two types of distributional shifts. Concerning concept shift and covariate shift, we have followed the commonly used definitions in the OOD community (please refer to Sections 3.1 and 3.2 of [13]).

High-level speaking, covariate shift refers to a situation where spurious features have different support sets in the training and test environments, and there is no association between YY and spurious features. For example, in a binary classification problem involving cows and sheep, where the training environment has a grass background and the test environment has a desert background, the probability of cows and sheep appearing in both environments is 50%, i.e., there is no correlation between the classes and the background.

Concept shift, on the other hand, refers to a situation where there is an association between YY and spurious features, and this association changes with the environment, while the spurious features have the same support set in the training and test environments. For instance, there are two training environments, one with a grass background and the other with a desert background. The probability of cows appearing in the grass and desert backgrounds is 80% and 20%, respectively, and for sheep, it is 20% and 80%, respectively. In the test environment, the probability of cows and sheep appearing in both backgrounds is 50%, meaning that the spurious association between background and class changes during testing.

According to the above definitions, IRM is capable of addressing the concept shift (as seen in the theoretical and experimental results of the IRM paper). Moreover, our Eq. (2) also aligns with the aforementioned definition of concept shift: there exists a stable correlation between YY and the invariant features X1X_1, and there exist spurious correlations between YY and spurious features X2X_2. We theoretically proved that CIA can solve this type of shift in Theorem 3.1.

[13] GOOD: A Graph Out-of-Distribution Benchmark, NeurIPS 2022

审稿意见
5

This paper addresses OOD generalization in node-level graph neural networks. The authors theoretically analyze why popular invariant learning methods like IRM and VREx fail on graph data, and propose two novel solutions: CIA and its environment-label-free variant, CIA-LRA. These methods enforce intra-class similarities to improve OOD generalization. The authors provide theoretical guarantees, including an OOD generalization error bound under the PAC-Bayesian framework. Experimental results on graph OOD benchmarks demonstrate the effectiveness of the proposed methods, particularly CIA-LRA, in improving node-level OOD generalization performance.

优点

  1. The paper provides a comprehensive theoretical investigation into why existing invariant learning methods struggle with graph data, supported by formal proofs.
  2. The proposed CIA and CIA-LRA methods offer innovative approaches to tackle node-level OOD generalization, addressing limitations of previous methods.
  3. The authors conduct comprehensive experiments on multiple datasets and compare against various baselines, demonstrating consistent performance improvements.

缺点

  1. Limited comparison to recent node-level OOD method [1].
  2. Lack of visualization of the overall method architecture to help readers understand intuitively.
  3. While Appendix J briefly mentions GPU requirements, a more detailed analysis of the computational complexity and runtime comparisons of CIA and CIA-LRA against baselines would be valuable.
  4. Insufficient citations of recent methods, such as [2] [3].

[1] Learning invariant representations of graph neural networks via cluster generalization

[2] Joint Learning of Label and Environment Causal Independence for Graph Out-of-Distribution Generalization

[3] Individual and structural graph information bottlenecks for out-of-distribution generalization

问题

N/A

局限性

N/A

作者回复

We thank Reviewer LzeD for your careful reading and detailed comments! We'd like to address your concerns in the following points:


Q1: Limited comparison to recent node-level OOD method [1].

A1: We've added the evaluation of CIT [1], the results are in Table B of the rebuttal PDF. We use the default hyperparameter setting for CIT from the code provided by the authors of [1]. CIA-LRA outperforms CIT on all splits.

Q2: Lack of visualization of the overall method architecture to help readers understand intuitively.

A2: We've added a figure to better illustrate the overall framework of CIA-LRA, please refer to Fig. A in the rebuttal PDF file.

Q3: a more detailed analysis of the computational complexity and runtime comparisons of CIA and CIA-LRA against baselines would be valuable.

A3: We show the time cost to reach the best test accuracy on our largest dataset Arxiv (with 50k~60k nodes) in Table C below. On smaller datasets (Cora, WebKB, CBAS), the running time gap is small and negligible (all methods including CIA-LRA only cost less than 60s to reach the best performance). Table C shows that the time costs of CIA-LRA and CIA are comparable to baseline methods.

Table C: Time cost (seconds) to achieve optimal test performance on Arxiv using GAT on a single RTX 3090 GPU.

ERMCoralMixupEERMSRGNNGTransCITCIACIA-LRA, hop=6
Arxiv degree covariate74551758OOM34887OOMOOM14091248
Arxiv degree concept30360747OOM3960OOMOOM1551132
Arxiv time covariate462461207OOM1993OOMOOM1428292
Arxiv time concept4401481272OOM11628OOMOOM221989

Q4: Insufficient citations of recent methods, such as [2] [3].

A4: Thank you for pointing out the missing related works.

[2] address the graph-level OOD generalization. Their task settings differ from ours in two aspects: 1) They focus on covariate shift, so they enforce YperpGsY\\perp G_s which doesn't hold for concept shifts. However, our proposed CIA and CIA-LRA work for both covariate and concept shifts. 2) They assume the availability of environment labels while our work tackles invariant learning without environment labels.

[3] address both the graph-level and node-level OOD generalization. They proposed to minimize the mutual information (MI) between the input graph and its representations to eliminate spurious features and maximize the MI between the representations and corresponding labels to learn invariant features. However, their method lacks rigorous analysis of how it can achieve the above two goals. Additionally, their setting is also different from ours: [3] assumes the availability of multiple graphs, while our work focuses the learning on a single graph.

评论

Dear reviewer LzeD,

I hope this message finds you well. As the discussion deadline is approaching in a few hours, I kindly wanted to remind you of the pending feedback for our submission. We have tried our best to address the concerns and revised our paper accordingly. Your insights are invaluable to us, and we eagerly await your comments to improve our work further.

Thank you for your time and consideration!

审稿意见
6

This paper analyzes the failures of standard invariant learning techniques for node classification. Through theoretical analysis, they argue that methods such as IRM and VREx will learn spurious features on graph data. Using this as motivation, they design a new method, CIA, that introduces additional invariance regularization to better guide the GNN to learn better representations. They verify the effectiveness of their method on multiple benchmark datasets.

优点

  1. I think the authors do a good job of showing the failures of methods such as IRM and VERx on graph data. In particular, the theoretical analysis is interesting.

  2. The proposed method is straightforward and follows naturally from the theoretical analysis. Furthermore, the theoretical justification in Section 4 is welcome.

  3. The performance of the proposed method is good compared to baselines.

缺点

  1. The authors seem to leave out some baselines in their experiments. For example, see [Wu et al., 2024] and [Li et al., 2023b]. From Footnote 1 at the bottom of page 2, I surmise that this is because they are not "general solutions". However, I find this to be a weak reason for not including them as baselines. Why should that be a disqualifying measure? The authors should better motivate which baselines they compare against and which they don't.

  2. Similarly, the authors should include the performance when using MatchDG. In the paper, the authors note the similarity of their method to MatchDG. As such, it reasons that it should be included in the experiments as a baseline. If not, the authors should provide sufficient reasons why to not include it.

  3. There's a lack of discussion on the efficiency (time complexity) of the method. This is important, as the authors note that the performance of CIA-LRA peaks at around 6-10 hops (see line 674). I'm curious as to how efficienct it is to consider such a large number of hops, especially on larger graphs. Naively, this seems to me that it can be a drawback of applying this method on larger graphs.

问题

  1. What's IRMv1? You begin by mentioning the failure of IRM but then start calling it IRMv1. What's the difference?
  2. On line 135, the authors mention that the GNN used in their analysis resembles SGC. I'm curious how this effects the analysis. How different would the conclusions be if we considered a more standard GNN like GCN? To be clear, I'm not asking for a full analysis here or anything, I'm just curious of the practical limitations of considering SGC in the analysis.
  3. Could you share the time complexity of your method?

局限性

Yes.

作者回复

We thank Reviewer empj for your careful reading and detailed comments! We'd like to address your concerns in the following points:


Q1: The authors seem to leave out some baselines in their experiments. The authors should better motivate which baselines they compare against and which they don't.

A1: As you advised, we've added experiments of CaNet [Wu et al., 2024]. We didn't compare with [Li et al., 2023b] since their methods require the graphs of different environments to have the same nodes and edges so that their proposed locally stable regularizer can be applied, however, the datasets used in our experiments don't have this property. To include more baselines, we further implement CIT ([7], NeurIPS 2023, suggested by reviewer LzeD), a recent node-level Graph OOD method. The results are included in Table B in the global author rebuttal PDF. We use the default hyperparameters for CaNet and CIT. We observe that CIA-LRA outperforms CIT and CaNet on all splits.

Q2: As such, it reasons that MatchDG should be included in the experiments as a baseline.

A2: We've added the experiments of MatchDG in Table B in the rebuttal PDF. The results show that: 1) MatchDG outperforms IRM and VREx on 12 out of 16 splits. 2) MatchDG underperforms CIA on average (averaged over 16 splits except on Arxiv, CIA: 57.56, MatchDG: 56.73), and MatchDG got out of memory on Arxiv. We emphasize that although CIA is similar to MatchDG, our contribution is that we're the first to extend the idea of MatchDG to the node-level OOD task by providing a theoretical characterization of CIA’s working mechanism on graphs (Theorem 3.1) and reveal its superiority in node-level OOD scenarios. More importantly, our main contribution is that we further extend CIA to the scenarios where environment labels are unavailable by proposing CIA-LRA, which shows significant empirical gains (Table B) with theoretical guarantees (Theorem 4.2).

Q3: There's a lack of discussion on the efficiency (time complexity) of the method.

A3: To show the running time of CIA-LRA, we show the time cost to reach the best test accuracy on our largest dataset Arxiv (with 50k~60k nodes). The results are in Table C below. The time cost of CIA-LRA is comparable to baseline methods.

Table C: Time cost (seconds) to achieve optimal test performance on Arxiv using GAT on a single RTX 3090 GPU.

ERMCoralMixupEERMSRGNNGTransCITCIA-LRA, hop=6
Arxiv degree covariate74551758OOM34887OOMOOM1248
Arxiv degree concept30360747OOM3960OOMOOM1132
Arxiv time covariate462461207OOM1993OOMOOM292
Arxiv time concept4401481272OOM11628OOMOOM989

Q4: What's IRMv1? You begin by mentioning the failure of IRM but then start calling it IRMv1

A4: Sorry for the lack of clarity. IRMv1 refers to the practical implementation of the original challenging IRM objective, which is proposed by the authors of IRM. IRMv1 relaxes the bi-leveled optimization of IRM that ww should be the optimal classifier in all training environments to a gradient penalty term:

(textIRM)minw,phimathbbEeleft[mathcalLleft(wcircphileft(Xeright),Yeright)]quadtexts.t.winargmin_barwmathcalLleft(barwcircphileft(Xeright),Yeright)right22 textforall einmathcalEtr(\\text{IRM}) \\min_{w, \\phi} \\mathbb{E}_e\\left[\\mathcal{L}\\left(w \\circ \\phi\\left(X^e\\right), Y^e\\right)] \\quad \\text{s.t.} w\\in \\arg\\min\_{\\bar{w}} \\mathcal{L}\\left(\\bar{w} \\circ \\phi\\left(X^e\\right), Y^e\\right)\\right\\|_2^2 ~ \\text{for all} ~ e \\in \\mathcal{E}^{tr}

(textIRMv1)minw,phimathbbEeleft[mathcalLleft(wcircphileft(Xeright),Yeright)+betaleftnabla_ww=1.0mathcalLleft(wcircphileft(Xeright),Yeright)right22right](\\text{IRMv1}) \\min_{w, \\phi} \\mathbb{E}_e\\left[\\mathcal{L}\\left(w \\circ \\phi\\left(X^e\\right), Y^e\\right)+\\beta\\left\\|\\nabla\_{w|w=1.0}\\mathcal{L}\\left(w \\circ \\phi\\left(X^e\\right), Y^e\\right)\\right\\|_2^2\\right]

Q5: How different would the conclusions be if we considered a more standard GNN like GCN

A5: When we consider a GCN rather than an SGC, the main difference is that we will apply a weight matrix at each layer rather than just at the final layer. Still, consider the simple 2-dim data case, for a GCN (without activation functions), we can rewrite the inference of a layer as:

(H1(l)H2(l))=Aˉ(H1(l1)H2(l1))(θ1(l)θ2(l))\left(\begin{array}{ll} H_1^{(l)} & H_2^{(l)} \end{array}\right)= \bar{A} \left(\begin{array}{ll} H_1^{(l-1)} & H_2^{(l-1)} \end{array}\right) \left(\begin{array}{ll} \theta_1^{(l)}\\ \theta_2^{(l)} \end{array}\right)

where H1(l)H_1^{(l)} and H2(l)inmathbbRNtimes1H_2^{(l)}\\in\\mathbb{R}^{N\\times1} follows the definition of the original paper in Eq.(3), theta1(l)\\theta_1^{(l)} and theta2(l)inmathbbR1times2\\theta_2^{(l)}\\in \\mathbb{R}^{1\\times2} . For the GCN, the parameters of the first layer of the weight matrix theta1(1)\\theta_1^{(1)} and theta2(1)\\theta_2^{(1)} correspond specifically to invariant (first dimension of xx) and spurious (second dimension of xx) features, respectively. From the second layer upwards (lgeq2l\\geq 2), both dimensions of the representation at layer ll will be the linear combination of the two dimensions of the representation at layer l1l-1. Hence, the GCN removes spurious features only when the parameter at the first layer theta2(1)=0\\theta_2^{(1)}=0. In conclusion, we only need to focus on the first layer to analyze wether GCN learns spurious node features. This resembles the case of SGC where we only need to focus on the weights at the last layer. So the conclusions of the SGC could be easily extended to the GCN.

评论

| Additional Baselines + MatchDG

Thanks for the new experiments. These definitely enhance the results in the paper.

| To show the running time of CIA-LRA, we show the time cost

Interesting, thank you.

| IRMv1 refers to the practical implementation of the original challenging IRM objective

I appreciate the clarification and formulation.

| When we consider a GCN rather than an SGC...

Thanks for the discussion. If you have time, I think it would be worth the effort to include these results in the appendix.

I appreciate the detailed response. I've raised my score to a 6.

评论

Thank you for raising the score! We’ll add all new results of rebuttal to our paper. Thank you again for your efforts spent and constructive comments!

作者回复

We list the references of our rebuttal here (We start numbering from [4] to avoid conflicts with the numbering used by some reviewers):

[4] Spurious Feature Diversification Improves Out-of-distribution Generalization (ICLR 2024)

[5] Towards Understanding Generalization of Graph Neural Networks (ICML 2023)

[6] Subgroup Generalization and Fairness of Graph Neural Networks (NeurIPS 2021)

[7] Learning invariant representations of graph neural networks via cluster generalization (NeurIPS 2023)

[8] Simple spectral graph convolution (ICLR 2021)

[9] Dissecting the diffusion process in linear graph convolutional networks. (NeurIPS 2021)

[10] A Non-Asymptotic Analysis of Oversmoothing in Graph Neural Networks (ICLR 2023)

[11] Demystifying Structural Disparity in Graph Neural Networks: Can One Size Fit All? (NeurIPS 2023)

[12] Graph Out-of-Distribution Generalization via Causal Intervention (WWW 2024)

The contents of the rebuttal PDF:

  • add an illustration figure of the proposed CIA-LRA framework
  • add MatchDG (suggested by reviewer LzeD), CIT ([7] suggested by reviewer LzeD), and CaNet ([12] suggested by reviewer empj) as new baselines.
最终决定

This paper studied the graph OOD problem, which proposes a Cross-environment Intra-class Alignment (CIA) method to learn invariant representations. The paper provides a theoretical analysis on the failure of IRM and VERx on graph data, and the proposed method is derived from the theoretical results. Extensive results clear validate the effectiveness of the proposed method on node-level graph OOD task. Most of concerns of reviewers have been well solved through rebuttal. I believe this manuscript is ready to publish.