PaperHub
6.4
/10
Poster4 位审稿人
最低2最高5标准差1.2
2
5
4
5
3.3
置信度
创新性2.8
质量2.8
清晰度2.5
重要性2.0
NeurIPS 2025

Equivariance by Contrast: Identifiable Equivariant Embeddings from Unlabeled Finite Group Actions

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29

摘要

We propose Equivariance by Contrast (EbC) to learn equivariant embeddings from observation pairs $(\mathbf{y}, g \cdot \mathbf{y})$, where $g$ is drawn from a finite group acting on the data. Our method jointly learns a latent space and a group representation in which group actions correspond to invertible linear maps—without relying on group-specific inductive biases. We validate our approach on the infinite dSprites dataset with structured transformations defined by the finite group $G:= (R_m \times \mathbb{Z}_n \times \mathbb{Z}_n)$, combining discrete rotations and periodic translations. The resulting embeddings exhibit high-fidelity equivariance, with group operations faithfully reproduced in latent space. On synthetic data, we further validate the approach on the non-abelian orthogonal group $O(n)$ and the general linear group $GL(n)$. We also provide a theoretical proof for identifiability. While broad evaluation across diverse group types on real-world data remains future work, our results constitute the first successful demonstration of general-purpose encoder-only equivariant learning from group action observations alone, including non-trivial non-abelian groups and a product group motivated by modeling affine equivariances in computer vision.
关键词
equivarianceidentifiabilityrepresentation learninggroup theory

评审与讨论

审稿意见
2

The paper presents a framework for learning embeddings that are equivariant with respect to the action of a finite group on the data. This is achieved by leveraging observed pairs consisting of a data point and its transformation under a fixed group element.

优缺点分析

Strengths. The supplementary material is provided and includes both the theoretical proofs and additional experimental details for the conducted studies. However, at this stage, I am unable to offer a substantive evaluation of the strengths of the work, due to the concerns outlined in the Weaknesses section below.

Weaknesses.

1. The paper contains numerous typographical errors, unclear notations, and ambiguous conceptual explanations, which significantly reduce the overall clarity and hinder the readability of the work. It is essential that the authors carefully address these issues in a revised version. Below, I list several examples from the first two sections - though this list is not exhaustive - and many more issues are readily apparent in the experimental sections.

  • Line 5-7: The notation Rm,ZnR_m, \mathbb{Z}_n are not clear. What are periodic translations?
  • Line 9-10: non-Abelian -> nonabelian or non-abelian
  • Line 10: generalized linear group -> general linear group
  • Line 18: allows to study -> allows us to study
  • Line 19-22: Only when a group acts on a vector space (of finite dimensions), we have the group representation.
  • The caption of Figure 2 is non-informative since it has too many typos in crucial equations. What does "y'_i = gy", or "x_i = R(g)x_i" mean?
  • Line 21, 62, 64, etc.: \mapsto -> \rightarrow
  • Line 34: questions -> question
  • Line 38: perturbations[4] -> perturbations [4]
  • Line 58-59: "which are grouped 59 into n + 1 pairs undergoing": What does "pairs" mean here?
  • Line 66: Information of -> Information about
  • Line 66: what is "u"? Is it "c"?
  • Line 70: process Assume -> process. Assume
  • Line 77: the second \phi(y_i') should be \phi'(y_i')
  • Line 79: What is R(\Phi, \Phi')? The paper only defines R(g) or \hat{R}.
  • Equation (3): What is S here?
  • Line 84: where used -> where we use
  • Line 97: What is Eq. (X)?
  • Line 98: GL should not be in italic.
  • Line 99: a -> an
  • Line 112: Does "Assume that p_\phi = p_{id}" mean you assume that \phi is the identity map?
  • Line 114: I am not sure why the original vector relates to the equation right after.

2. The discussion of related work is insufficiently thorough and lacks critical engagement with relevant prior literature.

3. The experiments are conducted solely on the dSprites dataset. Given that the authors reference a wide range of real-world applications - including biology, neuroscience, drug discovery, and computer vision - the paper should include at least one task or dataset from these domains to substantiate these claims.

问题

The current state of the paper makes it very difficult to evaluate the work effectively. I understand the time and content constraints of the rebuttal phase; however, I believe it is both appropriate and necessary to request that the authors revise Section 2 - which outlines the proposed problem and methodology - as this revision is essential for enabling a meaningful re-evaluation of the paper.

It is important to note that this does not imply that only Section 2 requires revision. Rather, gaining a clearer understanding of the proposed method would facilitate a more informed and constructive assessment of the rest of the paper.

局限性

.

最终评判理由

I remain concerned about the numerous typographical errors and the lack of clarity in several parts of the manuscript. Many of the clarifications provided during the rebuttal phase should be integrated directly into the paper to enhance its readability and comprehensiveness. These unresolved issues continue to detract from the overall presentation. Although I have revisited my evaluation in light of the improvements, my recommendation still leans toward rejection at this point.

格式问题

There is no issue with paper formatting.

作者回复

Dear reviewer, thanks a lot for your critical assessment of our work which raised several important points.

1. Typos and Questions.

L 5-7, periodic translations: We mean translations with periodic boundaries.

Figure 2 caption; What does yi=gyy'_i = gy, or xi=R(g)xix_i = R(g)x_i mean?

Apologies for the confusion, the new caption reads: n+1n+1 samples are transformed using the same group action gg, yielding samples yy and yy' in observation space. An encoder ϕ\phi maps these samples into latent space (x^,x^\hat x, \hat x'), where nn samples are used to estimate a linear representation RR of the action. The minimizer of the contrastive loss ensures that x^i=Rx^i\hat x_i = R \hat x_i.

L 58-59: What does "pairs" mean here?

Consider two data points with latent representation (x,x)(x,x') such that x=gxx'=gx; we observe (f(x),f(x))(f(x),f'(x)). We leverage n+1n+1 such pairs to jointly discover the latent space and a representation of the group. Note, this is directly defined in the definition following the sentence.

L 66: u/c

Typo: we mean cc as used in Eq. 1 -- apologies.

L 79: What is R(Φ,Φ)R(\Phi, \Phi')?

Typo: we mean R^(Φ,Φ)\hat R(\Phi, \Phi'), as defined in Eq. 2.

Eq (3): What is SS?

Negative examples as implicitly described in L. 89; we added a sentence. These are uniformly drawn from the dataset.

L 97: Eq. (X)?

Eq. 4

Line 112: Does "Assume that pϕ=pidp_\phi = p_{id}" mean you assume that ϕ\phi is the identity map?

Yes, exactly, but this is a typo that slipped through from an earlier formulation. Given Eq 3, this should read "Assume that pϕ=pf1p_{\phi} = p_{f^{-1}} [...]". Our goal is that if ϕf\phi \circ f recovers a feature space in which we match the ground truth distribution, ϕf\phi \circ f becomes linear (see Thm 1 in response to Reviewer bGkb).

Line 114: I am not sure why the original vector relates to the equation right after.

The original vector xx is mapped to observed data f(x)f(x) and then mapped back to a vector x^=ϕ(f(x))\hat x = \phi(f(x)). EbC assures that x^=Lx\hat x = Lx, even though ff can be an arbitrary non-linear function (as long as it is bijective).

To your question about Section 2, we would be happy to discuss more specifics of the method. We believe that beyond the typos, the level of detail in Sec. 2 is self-sufficient. Could you let us know which part of the Methods need clarification?

2. Related work

We agree, and substantially extended our review of related work. See below.

3. Real-world data

Great suggestion. We added a real-world use case from neuroscience, where a rat is running on a linear track and hippocampal place cells are recorded. Please see our reply to reviewer zh3E.

Extended discussion of related work

Equivariant representation learning has been a prominent topic in ML research from different angles. Below we denote observed data as yYy \in Y, with y=gyy'=g \circ y as short-hand for the transformed data. Embeddings of the data are denoted as x,xXx, x' \in X, where XX is a vector space. They are related via the representation R:GGL(n,X)R: G \rightarrow GL(n, X) via x=gx=R(g)xx' = g \cdot x = R(g) x. We focus on methods for learning vector embeddings xx of the data yy, which are equivariant and/or invariant to group actions gg. In contrast, orthogonal related work focuses on learning representations of the group given XX, YY (Yang et al., 2023; Laird et al., 2024). Another rich area of literature focuses on neural network architectures which are invariant/ equivariant to specific predefined groups (Cohen et al., 2016; Satorras et al., 2021; Kondor et al., 2018; Finzi et al., 2020; Dehmamy et al., 2021).

Non-Invariant Methods. Invariance to all possible group actions may hinder downstream tasks (e.g. color-invariance for flower-type classification). Xiao et al. (2021) propose to use a contrastive learning framework to learn invariant subspaces to different types of augmentation. Eastwood et al. (2023) extend this by introducing a second entropy loss to encourage disentangled subspaces.

Equivariant & Invariant Representation with known group action gg (explicitly observing g in a parameterized form). E-SSL (Dangovski et al., 2021) uses contrastive learning to embed a reference sample of yy and multiple transformed samples yi=giyy_i'=g_i \circ y into a latent space xx, xix_i', which is split into an invariant and equivariant subspace. Invariant parts are learned via SimCLR (Chen et al., 2020) and equivariant parts through an auxiliary task that predicts the parametrized group action gig_i from xix_i'. EquiMod (Devillers et al., 2022) achieves equivariance by embedding pairs of yy, gyg \circ y by modeling the group action via a neural network uψ(x,g)=x^u_\psi(x, g) = \hat{x}' to approximate xx'. Park et al. (2022) leverage G-equivariant neural networks. Garrido et al. (2023) propose a variant using non-contrastive SSL losses.

Auto-encoder approaches form an alternative: Qi et al. (2022) encode pairs y,yy, y' into latent space and decode gg from concatenated embeddings x,xx, x'. Jin et al. (2024) embed yy into latent space, model the group action as x=R(g)xx' = R(g) x and decode yy' from the predicted xx', assuming full knowledge of R(g)R(g). Keurti et al. (2023) use a linear prediction x=R(g)xx' = R(g)x where R(g)R(g) is predicted by a neural network from the observed and parametrized gg, essentially an auto-encoding variant of Garrido et al., 2023.

Equivariant & Invariant Representation with unknown group gg (implicitly observing gg) Although the aforementioned approaches share conceptual similarity in their goal of learning invariant and equivariant embeddings of data by modeling linear relations in latent space, EbC does not assume knowledge of the parametrized form of group actions gg. Instead of observing information about gg directly, a second class of methods assumes that two or more pairs of data share the same underlying action, yiy_i, gyig \circ y_i. This is a key assumption we also require for EbC.

Encoder-only methods include CARE (Gupta et al., 2023), a contrastive learning framework to learn invariant and locally equivariant representations. CARE encodes two pairs of observations (y1,gy1)(y_1, g \circ y_1) and (y2,gy2)(y_2, g \circ y_2) with the same action gg such that the embeddings are related by the same matrix RgO(d)R_g \in O(d) via x1=Rgx1x_1' = R_g x_1 and x2=Rgx2x_2' = R_g x_2. CARE extends the InfoNCE loss with an equivariant term based on x1Tx1x_1^T x_1' and x2Tx2x_2^T x_2'. The invariant loss term of InfoNCE and the new equivariant loss are weighted and applied to the full embedding space. Yerxa et al. (2024) propose a variation of CARE in which the embedding space is split into an invariant and equivariant subspace group actions are encoded as x1=x1+bgx_1' = x_1 + b_g and x2=x2+bgx_2' = x_2 + b_g. STL (Yu et al., 2024), learns representations of y,yy, y' such that x,xx, x' are equivariant to the group action g and to learn a representation of gg itself, again from two pairs of data. Similarly to EquiMod, the group action is parameterized by a neural network. However, instead of assuming knowledge about g, they use a learned representation RgR_g as the second input to the neural network: x1=uψ(x1,Rg)x_1' = u_\psi(x_1, R_g) where RgR_g is predicted by another network Rθ(x2,x2)=RgR_\theta(x_2, x_2')=R_g from the latent representations of the second observed pair (y2,gy2)(y_2, g \circ y_2). To learn the correct representation of RgR_g, EquiMod is extended by a third loss term that maximizes the similarity of Rθ(x1,x1)R_\theta(x_1, x_1') and Rθ(x2,x2)R_\theta(x_2, x_2'). Like CARE, they do not learn separate subspaces and instead use weighting factors for the invariant and equivariant loss terms on a joint latent space.

The problem can also be approached with auto-encoding approaches: Winter et al. (2022) learn embeddings of yy and gg in which the factorize y=gyy'= g \circ y into an representation x^\hat{x} of yy which is invariant to gg and a representation RY(g)R_Y(g) that represents the action gg in the data space YY, such that y=RY(g)δ(x^)y' = R_Y(g) \delta(\hat{x}) with δ\delta being the decoder.

The Unsupervised Neural Fourier Transform (U-NFT; Koyama et al., 2023; Miyato et al., 2022) is an auto-encoder framework with a linear group action model in latent space. For their learning setup, Koyama et al. assume access to N sequences of data points (y0,...,yi,...,yN)(y^0, ..., y^i, ..., y^N) with yi=(y0i,...,yTi)y^i = (y^i_0, ..., y^i_T) with yt+1i=giytiy^i_{t+1} = g_i \circ y^i_t, one sequence for each implicitly observed, but unknown group action giGg_i \in G. An autoencoder reconstructs yt+1iy^i_{t+1} from the predicted x^t+1i\hat{x}^i_{t+1}, which in turn is predicted from x^t+1i=R(gi)xti\hat{x}^i_{t+1} = R(g_i) x^i_{t}. The representation from these examples is estimated using least squares and a post-hoc basis transformation on the set of R(gi)R(g_i) to find a block diagonal representation B(gi)B(g_i) to facilitate disentanglement of the irreducible components of R(gi)R(g_i). Mitchel et al. (2024) propose a variation of NFT, where the learning setup is restricted to directly produce a block-diagonal representation R(gi)R(g_i), avoiding the need for a post-hoc basis transformation. However, to achieve this, they also restrict R(gi)R(g_i) to be orthogonal.

Winter et al. (2022) and Allingham et al. (2024) only recover an invariant representation and model the group action separately in the data space. All other the methods in this section try to solve the same general task of learning equivariant and invariant representations of the data without explicit knowledge of the underlying group actions. CARE (Gupta et al., 2023) is the closest encoder-only (contrastive) approach to EbC, but constrains the embedding to the hypersphere, which imposes additional structure on the learned representation, while EbC can learn different topologies (e.g., torus in Fig. 4, hypersphere in Fig. 5). In terms of function and data requirements, U-NFT (Miyato et al., 2022; Koyama et al., 2023) is the conceptually closest work, but requires to learn a full generative model of the data.

评论

I thank the authors for their response. The revisions have improved the clarity of the paper, and the core contributions are now more accessible. Notably, the expanded discussion of related work provides valuable context and more clearly positions the paper within the existing literature.

The inclusion of the new experiment based on a real-world neuroscience scenario—where hippocampal place cell activity is recorded as a rat traverses a linear track—adds empirical strength and highlights the practical relevance of the proposed method.

Nonetheless, I remain concerned about the numerous typographical errors and the lack of clarity in several parts of the manuscript. Many of the clarifications provided during the rebuttal phase should be integrated directly into the paper to enhance its readability and comprehensiveness. These unresolved issues continue to detract from the overall presentation. Although I have revisited my evaluation in light of the improvements, my recommendation still leans toward rejection at this point.

评论

Dear reviewer,

Thank you very much for following up and acknowledging the improvements through the provided clarifications, the expanded related work section, and the real world experiments. Of course, the content we posted during the rebuttal went into the updated manuscript.

We apologize again about typos (which we clarified above), but want to respectfully push back on your assessment that this prohibits understanding our method, which is also mirrored in other reviews. From your suggestions on how to correct the flagged typos, it is apparent that the intended meaning was clear from the context given in the paper.

What are you remaining concerns about Section 2 and other sections? We believe this section rather comprehensively outlines the steps for implementing EbC and its theoretical foundation, but we would be happy to revise unclear parts of it and post them here.

评论

Dear reviewer, in the meantime, we wanted to share our revised Section 2; most suggested changes were applied following the paragraph on "Implicit group representations". We hope that the revised section clarifies remaining concerns about our approach.

Please note, two equations were too complex to render on OpenReview, marked with * below; in the paper, these equations are typeset in a single line as in our submission.


Implicit group representations. We model the group representation via a non-parametric approach. Assume that for each group element gGg\in G, we are given two matrices X,XRM×d\mathbf{X},\mathbf{X}'\in\mathbb{R}^{M\times d}, M>dM>d, where the row vectors (xi,xi)(\mathbf{x}_i,\mathbf{x}_i') are related via the group element as xi=gxi\mathbf{x}_i'=g \mathbf{x}_i, i{1,,M}i\in\{1,\dots,M\}. As a shorthand, we write X=gX\mathbf{X}'=g\mathbf{X}. Then, the expression

R^(X,X)=minRGL(d)XXRF2R^(X,X)=(XX)1(XX)\hat{\mathbf{R}}(\mathbf{X},\mathbf{X}') =\min_{\mathbf{R}\in\mathrm{GL}(d)}||\mathbf{X}'-\mathbf{X} \mathbf{R}^\top||_F^2 \Leftrightarrow \hat{\mathbf{R}}(\mathbf{X},\mathbf{X}') =(\mathbf{X}^\top\mathbf{X})^{-1}(\mathbf{X}^\top\mathbf{X}')

is a representation of GG with R(g)=R^(X,gX)\mathbf{R}(g)=\hat{\mathbf{R}}(\mathbf{X},g\mathbf{X}) for each gGg\in G. In practice, we do not have access to (X,X)(\mathbf{X},\mathbf{X}') directly, but only to a nonlinear projection of these points via a mixing function f\mathbf{f}, denoted (f(X),f(X))=(Y,Y)(\mathbf{f}(\mathbf{X}),\mathbf{f}(\mathbf{X}'))=(\mathbf{Y},\mathbf{Y}'). We map these observed samples to a feature space using a learnable encoder ϕ:RDRd\phi:\mathbb{R}^D\to\mathbb{R}^d and insert the resulting matrices into the above. Our goal is to optimize ϕ\phi such that R^(ϕ(Y),ϕ(Y))\hat{\mathbf{R}}\bigl(\phi(\mathbf{Y}),\phi(\mathbf{Y}')\bigr) becomes a representation of GG. An advantage of this approach is that both the feature space and the group representation are fully defined via the feature encoder ϕ\phi.

Equivariance by Contrast. Given this structure, we propose our model, called Equivariance by Contrast (EbC). The intuition is depicted in Fig. 3: the model gets access to two observed sample sets related by the group action (Y\mathbf{Y} and Y\mathbf{Y}'), along with a query sample y\mathbf{y}. The objective is to infer the group action from (Y,Y)(\mathbf{Y},\mathbf{Y}'), apply it to the query, and select the correct answer y\mathbf{y}' among a set of options SS that include y\mathbf{y}' and negative samples randomly drawn from the dataset.

Formally, this objective is encoded via the likelihood*

pϕ(yy,Y,Y,S)p_\phi(\mathbf{y}' | \mathbf{y},\mathbf{Y},\mathbf{Y}',S) =exp(uϕ(y,Y,Y)ϕ(y)2)/= \exp(-||\mathbf{u}_\phi(\mathbf{y},\mathbf{Y},\mathbf{Y}')-\phi(\mathbf{y}'')||^2) / ySexp(uϕ(y,Y,Y)ϕ(y)2)\sum_{\mathbf{y}''\in S}\exp(-||\mathbf{u}_\phi(\mathbf{y},\mathbf{Y},\mathbf{Y}')-\phi(\mathbf{y}'')||^2)

We used the shorthand uϕ\mathbf{u}_\phi to denote the operation of inferring the linear representation of the group element, R^(ϕ(Y),ϕ(Y))\hat{\mathbf{R}}(\phi(\mathbf{Y}), \phi(\mathbf{Y}')), and applying it to the feature vector of the query ϕ(y)\phi(\mathbf y):

uϕ(y,Y,Y)=R^(ϕ(Y),ϕ(Y))ϕ(y)\mathbf{u}_\phi(\mathbf{y},\mathbf{Y},\mathbf{Y}') =\hat{\mathbf{R}}\bigl(\phi(\mathbf{Y}),\phi(\mathbf{Y}')\bigr) \phi(\mathbf{y})

To find the optimal feature encoder ϕ\phi, we minimize the negative log-likelihood across all pairs of samples and uniformly sampled negative examples*,

minϕL[ϕ]\min_{\phi}\mathcal{L}[\phi] =Ey,y,Y,Y,S[logpϕ(yy,Y,Y,S)],=-E_{\mathbf{y},\mathbf{y}',\mathbf{Y},\mathbf{Y}',S}[\log p_\phi(\mathbf{y}'\mid\mathbf{y},\mathbf{Y},\mathbf{Y}',S)],

which is closely related to the InfoNCE loss [30] augmented with the group-structure constraint.

审稿意见
5

The authors present an algorithm to learn a representation of the group acting on a dataset as well as the latent space of the dataset. The author assume that the data xix_i is only observable through a transformation yi=f(xi)y_i = f(x_i). The authors address the problem by learning a mapping ϕf1\phi \approx f^{-1} (up to a linear transformation on x) and a matrix RR being a representation of the group action. They use a likelihood model as objective function to find ϕ\phi and RR. ϕ\phi is represented with a neural network.

优缺点分析

Strengths

  • The introduction is well-written.
  • The concepts of section 2 are introduced with enough clarity to be understood.

Weaknesses

  • The proof of Theorem 1 is given in Appendix A.3 while the main text presents an "informal" version if the theorem. Unfortunately, the appendix does not present a formal version of the theorem.
  • The empirical evidence of Theorem 1 and Corollary 2 on a synthetic dataset, without noise, in section 4 is not convincingly explained.
  • In a context where one seeks to learn the group acting on a dataset, it does not seem realistic to assume that both YY and YY' are known. -In the subsection "Metrics", h^\hat{h} is not defined.
  • Experiments on SO(n),O(n),GL(n)SO(n), O(n), GL(n), are limited to n=3n=3.

问题

  • The model of Eq. (2) assumes that the same group action (i.e., the same matrix R) is applied to the complete dataset X. Why do the authors make this assumption? It would seem more realistic that different actions are applied on different data points.
  • Why using the terms "content" and "style" in section 2?
  • The authors do not detail how they deal with different (1000) group actions in their experiments. Indeed, the method on section 2 is only described in the case where a single group action is applied on the whole dataset.
  • In the section "Metrics", the quantities of Eq. (8)-(10) use hh, which should be unknown since h=ϕfh=\phi\cdot f and ff is unknown. How are these quantities computed in practice?

局限性

Yes.

最终评判理由

I thank the authors for answering my concerns. I have raised my rating accordingly.

The authors have carefully and satisfyingly answered the points I made in my review.

The clarifications made by the authors in their nine responses should be included in the final version.

格式问题

No.

作者回复

We thank the reviewer for their feedback. We appreciate the opportunity to clarify the points raised and answer open questions.


Re: Formal Statement of Theorem 1

You are correct that the manuscript would benefit from a more formal statement of the theorem in the appendix. We will update the appendix accordingly. Here is the full version we will include:

Theorem 1 (Identifiability of the Group Representation): Let ϕ\phi be a model parameterizing the distribution pϕp_\phi (Eq. 3), and let the model satisfy the diversity conditions in Def. 4. Let ff be an injective mixing function, and gGg \in G be a group element according to Def. 3. Let RR be the representation of GG, and R^\hat R be the implicit representation of the group according to Eq (2) such that R^(X,gX)=R(g)\hat R(X, gX) = R(g) for pairs of transformed samples (X,gX)(X, gX) and group actions gGg \in G.

Then, for matching conditional distributions pϕ=pf1p_\phi = p_{f^{-1}}, the following holds for all points xx and group actions gg the dataset:

  1. We recover the original vector space ϕ(f(x))=Lx\phi(f(x)) = L x, up to an ambiguity LGL(d)L \in GL(d).
  2. We recover a representation of group in the dataset, R^(f(X),f(gX))=LR(g)L1\hat R(f(X), f(gX)) = L R(g) L^{-1}.

Re: Empirical Evidence on Synthetic Data (Noise)

We believe this is a potential misunderstanding. Our empirical validation exactly matches the assumptions of Theorem 1 and Corollary 2.

To clarify the potential source of this misunderstanding. We indeed do require the assumption of the Diversity Condition (Appendix A, Def. 4). But the Diversity Condition doesn't not imply the presence of noise in the datasets. Instead it requires that there are enough group actions giGg_i \in G with sufficient variation such that this leads to the required diversity in the set of positive and negative pairs.

Nonetheless we do see the value in systematically studying the effect of noise on the data generating process.

To facilitate this, we adjusted our data generating process to include a gaussian noise term in the group action relationship in latent space: x=R(gi)x+ϵx' = R(g_i) x + \epsilon with ϵN(μ=0,σ2)\epsilon \sim N(\mu=0, \sigma^2) We set σ2[0.0,105,104,103,102,101]\sigma^2 \in [0.0, 10^{-5}, 10^{-4}, 10^{-3}, 10^{-2}, 10^{-1}] and additionally run with a different amount of samples per action (2n2n, 3n3n, 4n4n, 5n5n) for GL(n)GL(n) and n=3n=3. The samples per action are the number of pairs (x,x)(x, x') we use to fit R^(gi)\hat{R}(g_i) for any given gig_i.

Except for the described update to the group action relationship we follow the setup of Table 1 described in our paper.

Table 3.1: EbC on noisy data

AccCCR²(G)R²(x)
Samples Per ActionNoise Std.
60e+0098.7±0.693.1±1.799.6±0.0
61e-0598.7±0.893.2±1.399.6±0.1
61e-0498.7±0.793.7±2.299.6±0.1
61e-0398.6±0.693.3±1.999.6±0.1
61e-0298.7±0.793.0±1.499.6±0.0
61e-0199.1±0.591.6±0.499.6±0.1
90e+0098.4±0.599.5±0.299.8±0.1
91e-0598.7±0.699.6±0.199.8±0.0
91e-0498.6±0.599.6±0.199.8±0.0
91e-0398.6±0.699.6±0.199.8±0.1
91e-0298.5±0.899.6±0.199.8±0.0
91e-0199.1±0.496.9±0.499.8±0.0
120e+0098.5±0.799.7±0.099.8±0.0
121e-0598.6±0.699.7±0.199.8±0.0
121e-0498.7±0.699.7±0.199.8±0.0
121e-0398.5±0.799.7±0.199.8±0.0
121e-0298.6±0.899.7±0.199.8±0.0
121e-0199.0±0.797.9±0.299.8±0.0
150e+0098.7±0.599.7±0.099.8±0.0
151e-0598.5±0.599.7±0.099.8±0.0
151e-0498.5±0.599.7±0.099.8±0.0
151e-0398.5±0.699.7±0.099.8±0.0
151e-0298.6±0.799.7±0.099.8±0.0
151e-0199.0±0.598.3±0.299.8±0.1

Summary: Introducing Gaussian noise results in a drop in performance in terms of identifying an GL(n)-equivariant representation. This can be observed via the R2(G)R^2(G) metric, which reduces from ~93% to ~91% for EbC models with 2n2n sample pairs per action used for fitting R^(G)\hat{R}(G). But as we would expect for this noisy setting, increasing the number of samples per action seems to rectify this problem. As we increast to 3n3n samples per action, we get ~99.5% R2(G)R^2(G) for the noiseless setting and any noisy settings with σ2<=0.01\sigma^2 <= 0.01, while the largest amount of noise tested (σ2<=0.1\sigma^2 <= 0.1) results in ~96.9%. Finally, setting samples per action to 5n5n, we get ~98.3% for the largest amount of noise tested and ~99.8% R2(G)R^2(G) for the less noisy settings.


Re: Assumption of Known Data Pairs (Y,Y)(Y, Y')

We appreciate you raising this concern, which Reviewer 4zKK raised as well.

We'd like to refer you to our answer there. But for your convenience, TLDR: This is a standard assumption all recent related work tackeling the same problem setting is making. See CARE (Gupta et al., 2023), STL (Yu et al., 2024), and U-NFT (Koyama et al., 2023). For a more detailed discussion on related work, we refer to our response to Reviewer s1fj.


Re: Definition of h^\hat{h}

We apologize for this oversight. What we mean here is hh, not h^\hat{h}.


Re: Experiments Limited to n=3n = 3

We'd like to refer you to the appendix where we already show results for SO(n),O(n),GL(n)SO(n), O(n), GL(n) with n[3,5,7,9]n \in [3,5,7,9].


Re: Question on a Single Group Action

We feel, there may be a missunderstanding. Of course we don't make the assumption that there exists only a single group action (=same matrix R(g)R(g)) that is applied to the complete dataset.

To clarify: The observed variables are Y,YY, Y' not XX. XX instead are the embeddings X=ϕ(Y)X = \phi(Y). The data generating process used for the theory and empirically validation on synthethic data is defined via Eq.(1). The Eq.(2) on the other hand, simply describes how, given latent representations XRM×dX \in \mathbb{R}^{M \times d} and XRM×dX' \in \mathbb{R}^{M \times d} of MM observed sample pairs Y,YY, Y' we can compute an estimated representation R^(X,X)=R^(gi)\hat{R}(X, X')=\hat{R}(g_i) of the group action gig_i these sample pairs share.

Indeed for our experiments on the synthethic group dataset, we always generate 1000 random R(gi)R(g_i) from the respective group (O(n),SO(n),GL(n)O(n), SO(n), GL(n)). And in the case of dSprites, the gi=gi(T)gi(R)g_i = g_i^{(T)} \circ g_i^{(R)} observed during training are actually compositions of translations gi(T)g_i^{(T)} and rotations gi(R)g_i^{(R)} group actions.


Re: Question on "Content" and "Style"

With these terms we wanted to intutively refer to the unchanging components, the invariant part of the representations and the changing components so the equivariant part of the representations. Imagine the objects (content) of an image and the color, rotation, and scale (style) of these objects. However, if these terms are confusing or easily interpreted in a misleading way, we're happy to discuss alternatives.


Re: Question on Dealing with 1000 Group Actions

As referenced to earlier, the way we do this is defined via Eq.(2) and by extension Eq. (3-5). To clarify Y,YY, Y' or their embeddings X,XX, X' are not the whole dataset. Instead these represent a group of paired samples for which the pairs are all related by the same group action gig_i. Think of them instead as Yi,YiY_i, Y_i'. Note that Eq.(5) defines these are sampled from pdatap_{data}, which represent the actual distribution over the full dataset.


Re: Question on Computing Metrics

Yes, this absolutely correct ff is unknown. However, f(x)f(x) is not, because f(x)=yf(x) = y represent the observed variable. See Eq. (1). On the other hand xx in this case are the ground truth latents, which usually are unknown. This is why we required a purely synthethic data setting to exactly verify the theoretical claims empirically.

In practice for real-world datasets or benchmark datasets like DSprites where the ground truth latents are unknown these metrics can not be computed. This is why we introduce the Acc(G, k) and Acc(C, k) as proxy metrics.

评论

I thank the authors for answering my concerns. I have raised my rating accordingly.

The authors have carefully and satisfyingly answered the points I made in my review.

The clarifications made by the authors in their nine responses should be included in the final version.

评论

Thank you for your positive assessment following our discussion. We have incorporated the suggested changes into our paper and shared a summary as a global comment for all reviewers and the AC. We appreciate your time and suggestions.

审稿意见
4

This paper considers the learning of equivariance by unsupervised/contrastive means. The key mathematical result is on identifiability in this setting, by extending results of Roeder et al. a little bit. The algorithmic approach follows the mathematical development and is demonstrated to work in one exemplary setting.

优缺点分析

STRONG: The problem of symmetry discovery is important in numerous settings, and the findings here for such a problem are intriguing. Note that the setting is where there are batches of samples under the same transformation. The paper hits all the points of formulation, theory, algorithm, and basic experimental demonstration.

WEAK: The seemingly non-native English phrasing, e.g. "Group theory allows to study this structure" -> "Group theory allows studying this structure" could be improved.

Even though Thm 1 is the main result, the technical diversity conditions are omitted in the main text: to this reviewer, this feels like a significant shortcoming that is easily rectified by moving them up from Appendix A.

The mathematical advance over Roeder et al [14] seems marginal.

I am not an expert in the motivating examples, but the batched data needed for this approach seems like a very strong assumption.

问题

Clarifying the differences between the present work and other past work on learning group-theoretic structure would be very helpful in clarifying the contributions here. For example [Dehmamy et al., "Automatic symmetry discovery with lie algebra convolutional network," NeurIPS 2021], [Yang, et al., "Generative adversarial symmetry discovery," ICML 2023], or [Yu et al., "Information lattice learning," JAIR 2023]. See also references therein and thereto. Can you do that? It would help with your claim about "group representation learning from unlabeled observational data is feasible at all".

局限性

yes, they have addressed limitations. no discussion of potential negative societal impact.

最终评判理由

revised up, given the clarifications w.r.t. past literature

格式问题

formatting seems fine

作者回复

We thank the reviewer for recognizing the importance of the problem and for their valuable feedback. We are happy to clarify the points raised.

Regarding the English phrasing, we appreciate you pointing this out and will perform a thorough proofread for the camera-ready version.


Re: Placement of Diversity Conditions

Thank you for raising this concern, we agree that the theory section would benefit from highlighting the full set of assumptions more prominently and concisely. in the revised version of the main text.

We would like to point out thought, that the diversity condition has been (informally) introduced in the main text (lines 107-111) right before stating the Theorem 1. We will move the consice but complete deftinition of all conditions from Appendix A into the main theory section to make Theorem 1 and its assumptions clearer.


Re: Mathematical Advance Over Roeder et al. [14]

Our Theorem 1 shows that group representations can be learned via the implicit represention in Eq. 2 and provides an identifiability proof for jointly learning the latent space and the group action using EbC. This is orthogonal to the key result in Roeder et al. [14].

We believe that you refer to the fact that as part of our proof, we indeed leverage prior results from Roeder et al. [14], specifically we extend the canonical discriminative form towards Euclidean spaces (Theorem 5, see appendix) and use this form as part of our proof. This extension is indeed marginal, but a technical detail of our proof strategy.

Does this clarify the concern? We would be happy to discuss more details.


Re: Batched Data Assumption

We appreciate the sentiment and are happy to elaborate on this:

  • Common Assumption: All of the recent related works known to us, which tackle this unsupervised problem, such as CARE (Gupta et al., 2023), STL (Yu et al., 2024), and U-NFT (Koyama et al., 2023), all require observing multiple pairs of data transformed by the same unknown action. Our data requirement is therefore in line with the current state of the art for this specific problem setting.

  • Practical Scenarios: This data structure naturally arises in many scientific and systems identification domains where one observes a system before and after an intervention/pertubation/transformation. For example:

    • Genomics (CRISPR Screens): Measuring gene expression in cells before and after a specific gene knockout is applied to a batch of them.
    • Structural Biology: Observing a protein's structure before and after a specific ligand binds.
    • Neuroscience (fMRI): Recording brain activity at rest and then again after presenting a specific stimulus to a subject.

For the most straight-forward application of EbC this requires discrete actions.

However, as we demonstrate in our new real-world experiment (see response to Reviewer zh3E), our method can be easily adapted to work effectively even with continuous auxiliary variables by discretizing the action space, showing its flexibility beyond these examples.


Re: Clarifying Contributions vs. Other Past Work

Thank you for suggesting these references. They allow us to better position our work. For a broader and more detailed comparison, we highly recommend checking out our extended related work discussion in the response to Reviewer s1fj.

  • Dehmamy et al. (NeurIPS 2021): This work on Lie algebra convolutional networks is an excellent example of building equivariant network architectures. This line of work is complementary to ours. Such methods use prior knowledge of a symmetry group to build a specific inductive bias into the model. In contrast, our work aims to discover the unknown symmetry transformation from data without pre-specifying the network architecture to be equivariant to a specific type of groups.

  • Yang et al. (ICML 2023): This work also learns a matrix representation R(g)R(g) of a group action. However, a key difference is that their representation acts directly on the observed data (i.e., y=R(g)yy' = R(g)y). Our method solves a different problem: learning a latent representation ϕ(y)\phi(y) and a group action that applies in that latent space. This allows our method to handle complex, high-dimensional observations where the corresponding group action in data space may not be linear.

  • Yu et al. (JAIR 2023): Thank you for this reference. Our understanding of "Information Lattice Learning" is that it learns transformations between data points in the data space, without an explicit connection to group theory or the goal of recovering a linear group representation. The "information lattice" provides a different, non-algebraic structure for understanding these relationships. We therefore see the goals and technical approaches as distinct from ours.

评论

Thanks to the authors for clarifying the standardness of the problem formulation of batched data, and how it arises in several practical scenarios. Also for clarifying the mathematical advance over Roeder et al. in terms of implicit representations. This, together with the neuroscience example added in response to other reviewers, gives me greater confidence in the work.

If the paper is accepted, I would recommend including the longer discussion of other literature not just in the rebuttal but also in the manuscript itself, since it helps contextualize. (As far as I know, the information lattice is very much group-theoretic and algebraic: in the finite group setting, it is equivalent to the subgroup lattice. Please check.)

评论

Dear reviewer,

Thank you for the follow-up. We're glad we could address your questions.

To clarify, we updated our manuscript with the additional literature suggestions, both in the introduction to the paper, and as a comprehensive review in the supplementary material (due to space constraints). We believe that this positions EbC much better in the existing literature, and thank you again for your suggestions.

If you have further questions and suggestions for improvements we can address, please let us know.

审稿意见
5

The paper proposes a method for learning representation of a group from data through contrastive learning approach with theoretical guarantees and verifies on synthetic and 2D image datasets.

优缺点分析

Strengths:

  • The paper is very well written, with clear motivation and exposition.
  • Figure 3 provides an intuitive visual explanation of the proposed method, making the main idea easy to understand.
  • Theoretical results (e.g., Theorem 1) demonstrate identifiability of the group action and latent space under mild assumptions, which is uncommon for self-supervised equivariant learning methods.
  • The approach is simple and efficient, relying on contrastive learning with an encoder-only pipeline and on-the-fly least-squares fitting, avoiding the complexity of generative models.
  • Strong empirical results: Experiments cover a diverse set of group structures (e.g., SO(3), O(3), GL(3), and product groups), demonstrating the generality of the method. Achieves near-perfect latent reconstruction and group prediction accuracy, outperforming relevant baselines on synthetic data.

Weaknesses:

  • Data efficiency is not systematically studied; there is no analysis on how performance scales with dataset size, batch size, or the number of negative samples, which is especially relevant given the theoretical assumptions on required samples. I would suggest including experiments sweeping dataset size and batch/negative counts to measure convergence and data efficiency.
  • Comparison to baselines is somewhat limited; recent related approaches such as Park et al. (ICML 2022), Homomorphism Auto-Encoder (Keurti et al., ICML 2023), and other group-equivariant learning methods are not included or discussed in detail.
  • The key assumption (that all n+1 samples in a mini-batch share the same, unknown group element) is not tested or validated on real-world datasets; it is unclear how well the method handles weaker or approximate group structure in practical settings.
  • Application to real-world data is limited; all experiments are on synthetic datasets (e.g., Infinite-dSprites), with no demonstration on realistic or noisy sequence data.

Additional References Suggested:

  • Park et al., “Learning Symmetric Embeddings for Equivariant World Models,” ICML 2022.
  • Keurti et al., “Homomorphism Auto-Encoder,” ICML 2023.
  • Winter et al., “Learning Invariant and Equivariant Representations in Unsupervised Group Settings,” 2022.

问题

  • Figure 2: The encoder in Caption is f, and there is another encoder \phi after the samples, which is a bit confusing. What is Q in the Figure 2?
  • I have less background in neuroscience and biology, and I’d appreciate some concrete examples on this.

局限性

Yes

最终评判理由

I read the author response and other reviews. I would like to acknowledge the very detailed response of the authors to my concerns and questions. The response addresses most of my questions, and I think my comment on e.g., data efficiency experiment should not be expected to be fully addressed during the short rebuttal period. I still hold my view that this paper should be accepted to the conference and am happy to defend. I encourage the authors to incorporate the new materials and further strengthen in the paper revision.

格式问题

no

作者回复

We would like to thank the reviewer for their positive assessment and very helpful and constructive feedback.

Before addressing the specific points, we'd like to highlight that on synthetic group data our emprically results also extend beyond n=3n=3 to n[5,7,9]n \in [5,7,9], see Appendix C, Fig. 7.

Re: Data efficiency

We very much agree with your proposal. We have run new experiments to analyze data efficiency by varying dataset size, number of negative samples, and samples per action (for fitting R(g)R(g)), keeping all other parameters identical to our s in Table 1. We will add the full results to the appendix.

Note: The following results also exist for SO(n), O(n), but GL(n) has performed strictly worse. Happy to provide SO(n), O(n) results upon request.
Note 2: We have marked the hyperparameter setting corresponding to the replication of Table 1 in each of the following result tables with **

Table 1.1: EbC on reduced dataset size, n=3n=3.

Dataset Size50k100k500k1000k **
Group TypeMetric
GL(n)R²(x)99.8±0.099.8±0.099.8±0.099.8±0.0
GL(n)R²(G)99.7±0.199.7±0.099.7±0.099.7±0.0
GL(n)AccCC91.9±0.997.2±0.898.5±0.698.5±0.7

Summary: The results show that our method is highly robust to smaller dataset sizes. Even with 20x less data, the identifiability of the equivariant representation remains nearly perfect (R2(x)R^2(x) and R2(G)R^2(G) stay above 99.7%). The primary effect of smaller datasets is a moderate drop in accuracy for the invariant part of the embedding (Acc(C)Acc(C)).

Table 1.2: EbC on reduced number of negatives in a batch, n=3n=3.

Batch Sizes (Negatives)102420484096819216384**
Group TypeMetric
GL(n)R²(x)99.7±0.099.7±0.199.8±0.199.8±0.199.8±0.0
GL(n)R²(G)99.5±0.199.6±0.199.6±0.199.6±0.199.7±0.0
GL(n)AccCC98.9±0.699.1±0.699.1±0.598.8±0.698.5±0.7

Summary: Reducing the number of negative samples has a minimal effect on performance. While there is a slight downward trend in the metrics, the changes are very small and often close to the standard deviation across runs.

Table 1.3: EbC on reduced number of samples per action for fitting the linear least squares, n=3n=3.

Samples Per Action345678912**
Group TypeMetric
GL(n)R²(x)98.5±1.15.5±10.089.6±28.899.6±0.099.8±0.199.8±0.099.8±0.199.8±0.0
GL(n)R²(G)-1.4±1.7-7.3±21.776.9±28.893.1±1.798.9±0.999.5±0.299.5±0.299.7±0.0
GL(n)AccCC99.1±0.335.4±27.198.4±0.398.7±0.698.4±0.898.6±0.898.4±0.598.5±0.7

Summary: These results confirm that the theoretical minimum of nn samples is insufficient in practice. This is expected, as achieving the theoretical limit would require the sampled pairs to form a full-rank system after being encoded into latent space, which is unlikely to occure consistently throughout the model training. However, performance recovers to >99%>99\% with a modest number of samples (e.g., >2n2n), demonstrating practical applicability.

Re: Comparison to baselines

We appreciate this point and have added an extended related work section to the paper (see the full discussion in the response to reviewer s1fj). The primary distinction between our method and the suggested works (Park et al. '22, Keurti et al. '23) is that our method does not require ground-truth knowledge about group actions gg.

  • Park et al. (ICML 2022): This method requires known group actions gg as input to its latent action model and requires the network architecture itself to be G-equivariant, presupposing knowledge of G.
  • Keurti et al. (ICML 2023): This work uses an Autoencoder framework and also requires known parametrized gg to learn a matrix representation R^(g)\hat{R}(g). While its goal of finding a linear representation is similar, it relies on a stronger supervisory signal and the overhead of a decoder.

In contrast, our approach only requires the weak assumption that sets of sample pairs underwent the same unknown transformation.

Re: Weak or approximate group structure & Validation on real-world dataset

We agree that these are crucial points and have addressed them both theoretically and empirically.

Theoretically, our identifiability proof (see Appendix) does not rely on a strict group structure. The core assumption about the data generating process that is used in the proof is that pairs of latent points (x,x)(x, x') are related by a linear map, x=Aixx' = A_ix. While we frame this in the context of group representations (Ai=R(gi)A_i = R(g_i)), the proof does not require the set of matrices {Ai}\{A_i\} to form a group. This inherent flexibility suggests our method is suited for data with approximate or learned symmetries as well.

To empirically validate this, we applied our method to a real-world neuroscience dataset (Grosmark & Buzsáki, Science 2016). Data: We used recordings from the hippocampus of rats running on a 1.6m long linear track. This yields two time series: neural activity {yt}\{y_t\} and behavior (position ptp_t and direction dtd_t). Setup: We formed pairs (yt,yt+k)(y_t, y_{t+k}) and clustered them based on the change in behavior Δpt,k=pt+kpt\Delta p_{t,k} = p_{t+k} - p_t. Each cluster of pairs represents an approximate, unlabeled transformation of the same type (action). We then applied our method (EbC) to learn a behavior-equivariant latent space from the neural data. We compare against CEBRA-Behavior (Schneider et al., Nature 2023), a state-of-the-art method for this task, using the same data splits and evaluation protocol. Evaluation: We decode position (k-NN Regressor) and direction (Logistic Regression) from the learned embeddings.

Table 2.1: Position Decoding (Median Absolute Error in cm)

MethodRat NameTrainValTest
CEBRA-Behaviorachilles01.67±0.0406.93±0.4205.57±0.51
CEBRA-Behaviorbuddy06.13±0.1712.41±0.6612.40±0.63
CEBRA-Behaviorgatsby06.68±0.1411.65±1.0911.40±0.30
EbCachilles00.83±0.0604.74±0.3505.76±0.47
EbCbuddy01.24±0.1014.04±0.7016.02±1.55
EbCgatsby00.89±0.0208.76±0.2414.28±0.87

Table 2.2: Direction Decoding (Accuracy %)

MethodRatNameTrainValTest
CEBRA-Behaviorachilles60.15±0.3255.95±0.9859.62±0.79
CEBRA-Behaviorbuddy60.26±0.2569.30±0.8158.33±1.36
CEBRA-Behaviorgatsby62.78±0.6456.94±1.8165.96±0.77
EbCachilles92.17±0.6688.62±1.7782.20±0.50
EbCbuddy85.40±3.3884.74±4.5779.06±5.28
EbCgatsby83.62±2.7084.58±1.2479.18±0.68

Results: Our method (EbC) performs close to / on par with CEBRA-Behavior for position decoding (test set relative error of 3.5-8.9% for EbC vs. 3.5-7.7% for CEBRA). Crucially, EbC significantly outperforms the baseline in decoding the direction of movement, achieving 79-82% accuracy on the test set compared to CEBRA's 58-69%. This successful application demonstrates that EbC can learn meaningful representations from noisy, real-world data with only approximate symmetries.

Finally, we also ran a systematic study on the effect of noise within the data-generating process. We refer to our reply to bGkb for these results.

Answers to Questions

Q1 - Figure 2

We apologize for the confusion caused by the typo and notation. We will correct Figure 2 in the revision. To clarify:

  • ff: Represents the unknown data-generating function, an injective map from a true latent space xx to the observed samples yy.
  • ϕ\phi: The learnable encoder that maps observed samples yy to an embedding space.
  • In the caption, it says "encoder ff", which is a very unfortunate typo and should read "encoder ϕ\phi"
  • QQ: This was a typo and should be R^(g)\hat{R}(g). It represents the estimated linear group representation that is computed on-the-fly via linear least-squares within each mini-batch.

Q2 - Example applications

We are happy to provide more concrete examples. Our framework is broadly applicable to any "intervention" setting where one observes a system's state before and after a common transformation is applied to a set of samples.

  • Medicine: One could study the effects of various drugs by observing patient physiology before and after treatment. The model could learn a "drug-effect-equivariant" representation, where the transformation corresponds to the physiological change induced by a specific drug.
  • Cell Biology: In single-cell RNA analysis, researchers measure gene counts, perform an intervention like a gene knockout, and measure again. Our method could identify representations equivariant to the effects of specific gene edits without prior knowledge of those effects.
  • Time-Series with Auxilary variable: As demonstrated in our new experiment, the framework can be creatively applied to time-series data where an auxiliary variable (like position) allows for grouping time-steps that represent a similar, unknown evolution of the system.
评论

Dear reviewer,

thank you again for your positive assessment of our work. We would be happy to follow up on any remaining questions during the remainder of the discussion phase.

评论

Dear AC, Dear Reviewers,

Thank you for the constructive feedback and the thoughtful discussion over the past days. Following the suggestions of the reviewers, we added new experiments (see rebuttal to zh3E), extended the discussion of related work (see rebuttal s1fj) and improved the description of EbC (see comment to s1fj). A comprehensive summary of changes to the manuscript is posted in a separate comment below.

Following the discussion phase, reviewers acknowledged our edits as follows:

  • Reviewer 4zKK states the clarifications & additional results gives them "greater confidence in the work."
  • Reviewer bGkb states "the authors have carefully and satisfyingly answered the points" made in the review.
  • Reviewer s1fj acknowledges that "revisions have improved the clarity of the paper, and the core contributions are now more accessible" and that " inclusion of the new experiment based on a real-world neuroscience scenario [...] adds empirical strength and highlights the practical relevance of the proposed method".

We were also happy to see reviewers highlighting a large number of positive aspects and strengths:

  • Problem significance (Reviewer 4zKK): Symmetry discovery is recognized as important; the work is intriguing for this problem.
  • Clarity and presentation (Reviewer zh3E): The paper is described as very well-written with clear motivation and exposition; Figure 3 is noted for its intuitive visualization.
  • Clarity of concepts (Reviewer bGkb): The introduction is well-written and the concepts in Section 2 were presented with enough clarity to be understood.
  • Theory and guarantees (Reviewer zh3E): The theoretical results (e.g., Theorem 1) provide identifiability guarantees that are uncommon for self-supervised equivariant learning methods.
  • Simplicity and efficiency (Reviewer zh3E): The encoder-only contrastive pipeline is praised for being simple and efficient, avoiding generative-model complexity.
  • Completeness of approach (Reviewer 4zKK): The paper is viewed as hitting all the points of formulation, theory, algorithm, and basic experimental demonstration.
  • Completeness of submission (Reviewer s1fj): Reviewer s1fj acknowledges that "supplementary material is provided and includes both the theoretical proofs and additional experimental details."

In the next comment below, we detail the edits to the manuscript. We hope that they comprehensively address the reviewer's concerns.

评论

Below we summarize how the discussion with the reviewers was incorporated into the manuscript. The additions to the text and experimental results are detailed in the respective rebuttals and official comments.

Edits by Section

  • Section 1 (Introduction)

    • Expanded and updated related work discussion (see rebuttal to s1fj for text); highlight difference in tasks (known vs unknown group actions) and method (encoder-only vs generative models, linear vs non-linear group representations) to better position EbC, clarifying our contribution to the field (4zKK, s1fj, zh3E)
  • Section 2 (Method / "Learning group structure with contrastive learning")

    • Revised (see comment to s1fj for text); clarified notation for X,X,Y,Y,g,R(g),R^(X,X)X,X', Y,Y', g, R(g), \hat R(X, X'); explained “pairs”; defined SS (negatives). (s1fj)
    • Clarified use of multiple group actions across the dataset and how Eq. (2) estimates R^(g)\hat R(g) per action gg. (bGkb)
    • Moved formal definition of diversity conditions into the main text. (4zKK)
    • Corrected theorem 1 definition, fixing typo pϕ=pf1p_\phi = p_{f^{-1}}; standardized symbols/notations. (s1fj)
    • Justified the batched-data assumption with recent work (CARE, STL, U-NFT). (4zKK, bGkb)
    • Clarified the use of the terms 'content' and 'style' for equivariant and invariant representations. (bGkb)
  • Section 3 ("Experiment Setup")

    • Stated that synthetic group datasets are noiseless by design matching the theory; pointed to noise experiments in the appendix. (bGkb)
    • Clarified which metrics require ground-truth latents; reference metrics Acc(C, k) and Acc(G, k) as proxy metrics which don't require ground truth latents and point to existing results in appendix for empirical validation of these metrics as viable alternatives. (bGkb)
  • Section 4 ("Empirical Results")

    • More prominently highlighted the existing results for O(n)O(n), SO(n)SO(n), and GL(n)GL(n) with n>3n>3 are provided in the appendix. (bGkb, zh3E)
  • Section 5 ("Further Analysis, Ablations, and Modeling Choices")

    • We add a paragraph calling out the additional experiments (see below) and pointing to the respective sections in the appendix (zh3E, bGkb, s1fj)
  • Appendix

    • Revised statement of Theorem 1 and diversity conditions (bGkb)
    • Clarified contributions relative to Roeder et al. (identifiability proof for jointly learning the latent space and the group actions). (4zKK)
    • Included the extensive related work discussion posted in the rebuttal. (4zKK, s1fj, bGkb, zh3E)
    • Added more detailed discussion of potential applications and types of datasets matching the required data structure. (zh3E, 4zKK, bGkb)

Additional experiments. We added the following additional experiments to the appendix:

  • Data efficiency hyperparameter sweeps over dataset size, negatives, and samples per action. (zh3E)
  • Noise robustness in latent dynamics with varying levels of additive gaussian noise and varying samples per action. (bGkb, zh3E)
  • Real-world validation on time series data from neuroscience domain (recordings from rat hippocampus); applying EbC to achieve behavior-equivariant representations; comparison to CEBRA-Behavior. (s1fj, zh3E)

Presentation

  • Fixed typos, standardized notation, and improved non-native english phrasing. (s1fj, 4zKK, zh3E, bGkb)
  • Revised Figure 2 and caption (see s1fj rebuttal for text): encoder is ϕ\phi, QQR^\hat R; aligned notation with main text. (s1fj, zh3E)
最终决定

The paper introduces a contrastive learning framework for group representation learning with theoretical guarantees of identifiability, supported by synthetic and 2D image experiments. Reviewers consistently highlight the clarity of the writing, strong motivation, and intuitive visuals. The reviewers noted the simplicity of the method: encoder-only with least-squares fitting, avoiding generative models, and also for achieving strong empirical results across diverse groups (SO(3), O(3), GL(3), product groups), often outperforming baselines. Theoretical contributions, especially Theorem 1, are noted as uncommon in equivariant self-supervised learning, though some reviewers find the novelty over prior work (e.g., Roeder et al.) marginal. Weaknesses center on missing details or limitations: (1) data efficiency and scaling behavior are untested, leaving unclear how performance depends on dataset size, batch size, or negative samples; (2) comparison to recent related baselines (e.g., Park et al. 2022, Keurti et al. 2023, Dehmamy et al. 2021, Yang et al. 2023, Yu et al. 2023) is incomplete; (3) the strong assumption that all samples in a batch share the same group action is unrealistic and unvalidated on real data; (4) experiments are restricted to synthetic datasets, without noisy or realistic applications.

Despite these critiques, the overall consensus is positive: the theoretical guarantees, simplicity of the framework, and strong synthetic performance outweigh the weaknesses, warranting acceptance. One reviewer had concerns about notation and clarity, but on following the responses and reading the paper, I find the critique to be outsized -- should be easily fixable. Given all these considerations, I would recommend acceptance. The authors are encouraged to incorporate various points that have come up during the discussion, and improve discussion of prior work.