seq-JEPA: Autoregressive Predictive Learning of Invariant-Equivariant World Models
A self-supervised world model that jointly learns two architecturally distinct representations: one equivariant to specified transformations and another invariant to them
摘要
评审与讨论
This paper proposes seq-JEPA, a network architecture that simultaneously learns two representations, for equivariance and invariance respectively. This is achieved by encoding the relative transformation between two views and concatenate with the original encoding. Numerical experiments on benchmark datasets show the proposed architecture outperforms other methods designed only for either equivariance of invariance.
优缺点分析
Strength
- The paper proposes an interesting perspective of unifying invariance and equivariance in network architecture, and proposal a novel architecture achieving both goals.
- Experiments show decent performance on both invariance and equivariance tasks.
Weakness
- The relative transformations are fed as input to the model, but in practice this is unknown, which is a major limitation of the architecture.
- The experiments are conducted on synthetic or artificial datasets, and do not show the general applicability of the proposed architecture. This is related to the requirement of ground truth transformation info mentioned above.
- The paper does not provide new theoretical insights. While this is not a requirement for every paper, the limited practical applicability of the proposed architecture undermines the contribution of this paper.
问题
- Can you clarify on your architecture's reliance on having ground truth transformations, and whether such information is provided in all experiments? How realistic is this in practical applications?
- It would be interesting to closely examine the two learned features for invariance/equivariance separately, to understand how well each feature is indeed trained to perform the corresponding task.
- Please provide the time and memory cost comparison between seq-JEPA and other baseline methods.
局限性
The usage of ground truth transformations need to be clarified.
最终评判理由
I raised my review score to 4, as the authors' response addresses the majority of my previous concerns, especially regarding the assumption of known transformations and synthetic experiments.
格式问题
None
Response to Reviewer 4
We thank the reviewer for their thoughtful comments, and for highlighting the importance of grounding our assumptions, clarifying applicability, and assessing computational overhead. We address each of your concerns below.
On Access to Relative Transformations
We respectfully disagree with the claim that requiring access to relative transformations (actions) is a “major limitation” of our method. Our design is both consistent with prior work and motivated by embodied learning principles.
First, the assumption of known or controlled transformations is standard in the equivariant SSL literature. Many works in this space assume access to transformation/action parameters [A–C] or to at least two pairs of views where the same transformation/action is applied [D, E]. Our setting is consistent with this standard practice in the literature and follows this precedent by modeling the transition $ p(z_{t+1} \mid z_t, a_t), where \a_t$ is either known or can be approximated.
Second, as noted in the Introduction (see paragraph 3 in Intro), our design is grounded in the embodied learning perspective, where perception is inherently active. In such cases (e.g., robotic manipulation, foveated vision, embodied agents in simulation) the agent performs the actions itself and therefore has access to them. From a neuroscience perspective, this setup aligns with the concept of efference copy (or corollary discharge), i.e., an internal copy of motor commands sent to both the motor system and sensory areas to anticipate the consequences of self-generated actions. This mechanism enables animals to predict sensory inputs conditioned on their own actions, and we adopt the same inductive bias in seq-JEPA.
In domains where the agent does not execute the actions itself (e.g., passive observation or multi-agent settings), estimating the actions becomes necessary. In this case, the model can either be given approximate action estimates (e.g., via egomotion sensors or inverse modeling), or extended to jointly infer latent actions along with representations by observing other agents. This opens a natural direction for future work; building theory-of-mind-capable agents that model others’ actions and policies from observational data.
On Dataset Scope and Practical Applicability
While 3DIEBench involves synthetic renderings, our experiments are not limited to artificial domains. We also evaluate seq-JEPA on CIFAR-100, TinyImageNet, and PLS (Predictive Learning across Saccades on STL-10), all of which use natural images. In these settings, the “action” is either:
-
The relative augmentation parameters (e.g., crop coordinates, jitter levels, blur kernel size), which are available in any augmentation pipeline.
-
The relative positions between foveated glances, sampled with biologically plausible saliency and inhibition-of-return (IoR) heuristics (in PLS), which simulate human-like visual behavior.
Our results on these settings confirm that our approach generalizes beyond synthetic 3D datasets, and operates effectively in low-resolution, weakly-structured, or biologically grounded regimes.
On Separability and Interpretation of Learned Representations
We appreciate the request for a closer analysis of the learned equivariant and invariant features. Our probe-based results in Table 1 already reveal this division: the encoder (pre-aggregation) preserves action-relevant information (high R² in transformation prediction), while the aggregator (post-aggregation) supports invariant object classification (high top-1 accuracy).
Additionally, we include UMAP visualizations in Figure 5, which show this duality in latent space. The left and middle panels display structured orbits of transformed views around class centers in the encoder space—indicating equivariance. The right panel, which shows post-aggregation representations, demonstrates class-wise collapse and increased inter-class separability, confirming that the aggregator learns to filter transformation-specific variation. Finally, the middle panel, colored by rotation angle, confirms that encoder orbits are transformation-aware: latent trajectories evolve smoothly with rotation, and are consistent across object instances (e.g., compare the red cluster in the left panel with the corresponding cluster in the middle panel).
On Compute and Memory Overhead
We report pretraining compute time across methods (see table below). All experiments are run under the same configuration: a single A100 GPU, 128×128 resolution, batch size 512, for 1000 epochs on 3DIEBench. seq-JEPA with a sequence length of one has similar runtime to other baselines (e.g., BYOL, SimCLR), while seq-JEPA with sequence length of 3 incurs a moderate increase in wall-clock time (15.1 GPU-hours) due to encoding additional action-view pairs.
GPU-Hour Comparison Table
| Method | A100 pretraining GPU hours |
|---|---|
| SimCLR | 11.4 |
| BYOL | 11.1 |
| SIE | 12.5 |
| VICReg | 10.9 |
| EquiMod | 11.8 |
| seq-JEPA (train seq len = 1) | 12.3 |
| seq-JEPA (train seq len = 3) | 15.1 |
seq-JEPA’s inference cost is primarily governed by two factors: sequence length and input resolution. Below, we report detailed FLOP counts (in Gigaflops) per forward pass of a single datapoint across a range of configurations (Note: these reflect inference-time cost):
FLOPs by Configuration
| Resolution / Config | Encoder (pre-aggregator) | Post-aggregator (seq len = 1) | Post-aggregator (seq len = 2) | Post-aggregator (seq len = 3) | Post-aggregator (seq len = 4) | Post-aggregator (seq len = 8) |
|---|---|---|---|---|---|---|
| Each view is 128×128 | 0.60G | 0.63G | 1.24G | 1.85G | 2.46G | 4.9G |
| Partial obs (fovea 64×64, full image 224×224) | 1.82G (224) / 0.15G (64) | 0.18G | 0.34G | 0.51G | 0.67G | 1.32G |
| Partial obs (fovea 84×84, full image 224×224) | 1.82G (224) / 0.30G (84) | 0.33G | 0.64G | 0.95G | 1.26G | 2.51G |
We can observe that:
-
For single-view inference at 128×128 resolution, the encoder requires only 0.60G FLOPs, which matches the compute cost of standard SSL baselines such as SimCLR and BYOL.
-
As sequence length increases, the additional compute cost grows sublinearly as additional views should be encoded and aggregated. For example, lengths 2 and 3—where invariant performance improves substantially—incur ~1.2G and ~1.85G FLOPs, respectively, remaining within practical limits for most modern applications.
-
Crucially, this cost can be offset by reducing input resolution. In our saccade-based setup, we can feed low-resolution foveal glimpses (e.g., 64×64 or 84×84 crops sampled via saliency maps), which require only 0.15G–0.30G FLOPs per view. This enables the use of longer sequences at an overall compute cost comparable to full-resolution, single-view pipelines (e.g. compare the flops of aggregating eight 64 by 46 crops with flops of encoding a single 224 by 224 image in the second row of the table). This approach is particularly appealing in real-time robotic systems, where sensors frequently provide partial, asynchronous, or spatially limited observations (e.g., narrow field-of-view cameras, event-based sensors, or sequential glances across a scene). In such contexts, seq-JEPA can accumulate and integrate these glimpses over time to form a compact, invariant summary for downstream reasoning, while maintaining low computational cost.
We thank the reviewer again for their helpful suggestions. We hope that these clarifications, particularly around access to actions, applicability to natural images, and compute-performance trade-offs, address your concerns. We also invite you to consult our responses to the other reviewers.
References
[A] Garrido, Quentin, Laurent Najman, and Yann LeCun. Self-supervised learning of split invariant equivariant representations. Proceedings of the 40th International Conference on Machine Learning. 2023.
[B] Garrido, Quentin, et al. Learning and leveraging world models in visual representation learning. arXiv preprint arXiv:2403.00504 (2024).
[C] Devillers, Alexandre, and Mathieu Lefort. EquiMod: An Equivariance Module to Improve Visual Instance Discrimination. The Eleventh International Conference on Learning Representations, 2023.
[D] Yerxa, Thomas, et al. Contrastive-equivariant self-supervised learning improves alignment with primate visual area IT. NeurIPS 2024.
[E] Shakerinava, Mehran, Arnab Kumar Mondal, and Siamak Ravanbakhsh. Structuring representations using group invariants. NeurIPS 2022.
I appreciate the authors' detailed response, which address the majority of my concerns, especially regarding assumption of known transformation and synthetic experiments. I will raise the score to 4.
We thank the reviewer for engaging with our response and for reconsidering their assessment. We will revise the final version to clarify the assumption of known transformations and to better highlight applicability to natural image settings.
The authors present a self-supervised representation learning approach that utilises a sequence of inputs each encoding different transformation state. Representations of each state are then enforced through a sequential transformer architecture that represents sequential action states by their relative changes to transformation parameters. The authors demonstrate superior invariant (top-1 classification) performance on a variety of benchmark datasets, as well as reporting transformation (relative and individual) prediction performance. The authors also present different sampling schemes for the sequential transformations, which present interesting findings, notably in regard to predictive learning across saccades, which arguably will motivate future work.
优缺点分析
Strengths:
- Taking advantage of sequential transformation is an interesting direction, and the proposed method is simple yet effective. The proposed methodology aligns with speculative work, and makes good contributions to the direction of this research area.
- The sequence length investigation presents interesting results that are highly impactful in the community, notably when capturing representations that represent rotation and translations. This confirms and supports conjectures of other works that advocate for sequence learning for world-models and representation learning.
- Results demonstrate strong invariant and transformation prediction performance, reducing the trade-off demonstrated by previous works.
- I particularly like the honest and in-depth limitations and broader impact, it is nice to see you address the concerns around the use of known transformation groups.
- The predictive learning across saccades is a neat method to enforce such equivariance, while the performance is not entirely competitive, these findings motivate future work and studies.
Weaknesses:
- While the authors convincingly demonstrate the invariance of the learned representations. They do not fully substantiate the claim of equivariance, however, experiments should go beyond predicting the transformation parameters. While predicting the rotation is a useful task, it does not on its own guarantee that the representations are truly equivariant. A more direct analysis, which confirms that the entire representation vector transforms predictably with the input transformation, would provide stronger evidence for the models equivariance.
- Some of the implementation details are not entirely clear, notably are the transformation sequences a natural progression of actions, for example a step-by-step iterations of a fixed rotation amount or random transformations structured as a sequence?
- There is no direct analysis of representation equivariance. While the R2 rotation prediction acts as a proxy this is not a measure of equivariant
- Some benchmark results (notably, SIE) seem surprisingly low compared to their original reporting, this raises concerns about the implementation and comparisons, however, if the authors address this in the rebuttal this would add clarity.
- Other works such as [1] utilise a multi-view loss, and while not sequential like the method in question, is a good analogue for comparison, and arguably more comparable.
[1] Wang, J., Chen, Y. and Yu, S.X., 2024, September. Pose-aware self-supervised learning with viewpoint trajectory regularization. In European Conference on Computer Vision (pp. 19-37). Cham: Springer Nature Switzerland.
问题
- The rotation performance of SIE in table 1 is much lower than reported in their work. Can you explain this drop in performance?
- Can you clarify in the tables which results are obtained with the MLP or Transformer projector networks for the benchmarks?
- Have the authors explored the performance on unseen rotations / transformations?
- What is the additional compute overhead compared to prior works, this would help demonstrate the performance-efficiency trade-off.
局限性
yes
最终评判理由
I have maintained my already positive score given the additional evaluations have been presented during the rebuttal.
格式问题
No formatting concerns.
We thank the reviewer for their thoughtful and detailed review. Below, we address your questions and concerns.
On Substantiating Equivariance Beyond Transformation Prediction
We agree that demonstrating full functional equivariance—where the representation transforms in a predictable manner under input transformations—is stronger than transformation regression alone. Regression is often used since functional equivariance is hard to quantify in high dimensions. Therefore, our evaluation follows the established practice in equivariant SSL, where transformation prediction is used as a practical proxy for equivariance [B–D].
To go beyond this, we also provided qualitative evidence in Fig 5 (left and mid panels), which shows that UMAP projections of encoder representations form smooth, structured orbits around object category centers, with a continuous gradient across the rotation angle (see the visible alignment between the red band in class-based coloring and its smooth rotation-based coloring in Fig 5). This behavior is consistent with a structured, transformation-aware latent geometry and offers indirect evidence of equivariance.
That said, we agree that direct testing (e.g., using linear equivariant maps or commutativity checks with group actions) would provide more rigorous validation. We consider this a valuable direction for follow-up work, especially in the context of latent world modeling where group structures may be implicit, partial, or noisy.
Clarifying Transformation Sequence Generation
In our current setup, transformation sequences are externally defined and fixed during training—there is no learned policy over actions. Specifically:
-
3DIEBench: Each object is rendered with a random sequence of 3D rotations, sampled uniformly from across axes. These sequences are independent and diverse.
-
PLS: Saccade trajectories are generated using saliency maps with inhibition-of-return (IoR), producing structured, non-repetitive sequences over natural STL-10 images. This simulates foveated active vision and enables real-world applicability.
-
CIFAR-100 / TinyImageNet: Sequences are composed from standard data augmentations (crop, jitter, blur) with known parameters. “Actions” are defined as the difference between consecutive augmentation parameters.
Thus, our framework handles both randomized and structured transformation regimes.
On the SIE Performance Drop
Thank you for flagging this. The reported SIE performance drop is due to differences in hyperparameter regimes. The original SIE paper trained with: 2000 epochs, 256×256 input resolution, and batch size 1024.
In contrast, as noted in Section 3 of our paper, we standardized training across all baselines (including SIE) to: 1000 epochs, 128×128 resolution, and batch size 512. This was necessary to enable a fair comparison across baselines within our compute budget.
Comparison to [A]
We thank the reviewer for suggesting this comparison. [A] proposes a regularization that constrains adjacent views (e.g., with small pose deltas) to lie on smooth trajectories in latent space. This reflects a “local linearity” prior, which is well-motivated for passive or smooth environments (e.g., robotic arms, camera pans).
In contrast, seq-JEPA is designed for active perception settings—where transitions may be sparse, discontinuous, or highly non-local (e.g., long-range saccades, or abrupt object rotations). It assumes no trajectory smoothness or small-angle assumptions.
We empirically compare [A] by adding it as a baseline across three datasets. We implemented their best-performing model variant (VICReg + trajectory regularization) and tuned the regularization strength on 3DIEBench across values {0.001, 0.01, 0.1, 1.0}. We found to yield the highest top-1 accuracy and rotation prediction and used it for the other two datasets. Results are given below:
| Setting\Metric | Top-1 acc | R² (relative rot) | R² (abs rot) |
|---|---|---|---|
| 3DIEBench | 81.26 | 0.27 | 0.43 |
| Setting\Metric | Top-1 acc | R² (crop) | R² (jitter) | R² (blur) |
|---|---|---|---|---|
| CIFAR100 | 61.07 | 0.46 | 0.14 | 0.02 |
| Tiny ImageNet | 34.95 | 0.28 | 0.09 | 0.16 |
We see that trajectory regularization indeed improves over VICReg (compare with tables 1 and 2 in the paper), which shows that imposing geometric priors can improve both invariant and equivariant performance (even when the priors do not fully materialize in the environment, e.g., when we have non-smooth angle changes as in 3DIEBench).
Unseen Transformation Generalization (OOD Rotations)
Since 3DIEBench does not include an OOD test set, we created one ourselves to enable evaluation on unseen transformations. The original dataset samples rotations from . Our OOD test set instead uses the disjoint range , ensuring no angular overlap with the training set.
This makes the task strictly extrapolative, unlike [A], which evaluates models on denser but still interpolative pose samples from within the same angular support.
Rendering followed SIE’s protocol and took ~5 hours on a single A6000 GPU. We will release this benchmark publicly.
From the table below, we see that all methods fail at OOD rotation decoding (with R²s even reaching near −1.0). The sharp drop in R² values may be attributed to a domain shift in transformation geometry: while representations are well-aligned with ID rotation trajectories (inside the range $(-\pi/2, \pi/2)), they are not geometrically structured to extrapolate beyond this range to \(- \pi, - \pi/2) \cup (\pi/2, \pi)$. The near −1.0 R² values likely result from a reversal in regression sign: the learned probe fails to interpret latent geometry correctly in the unseen region and predicts anti-correlated values. This supports our interpretation that the latent space does not generalize to novel angular regions, despite appearing structured within the ID range.
Despite the failure in OOD equivariance generalization, seq-JEPA exhibits a graceful degradation in invariant classification accuracy compared to other baselines. We hypothesize that this robustness stems from its sequence-level aggregation mechanism: even when latent representations become less equivariant under OOD transformations, aggregation across multiple views still filters out transformation-specific variability and recovers shared semantic content. Thus, seq-JEPA maintains object identity better under distribution shift, indicating that aggregation over input views promotes invariance, even when equivariance is imperfect.
| Setting\Metric | Top-1 Acc (drop) | OOD R² (rel rot) | OOD R² (abs rot) |
|---|---|---|---|
| SimCLR | 63.86 (-17.27) | -0.41 | -0.201 |
| BYOL | 55.68 (-27.22) | -0.25 | -0.198 |
| VICReg | 61.28 (-19.20) | -0.31 | -0.206 |
| SEN | 60.03 (-23.40) | -0.41 | -0.201 |
| SIE | 60.19 (-17.30) | -0.63 | -0.200 |
| EquiMoD | 58.54 (-25.75) | -0.45 | -0.206 |
| Cond. BYOL | 57.91 (-24.70) | -0.42 | -0.202 |
| VICRegTraj | 62.94 (-18.32) | -0.36 | -0.214 |
| seq-JEPA (1,1) | 61.53 (-22.55) | -0.69 | -0.199 |
| seq-JEPA (3,3) | 65.03 (-21.11) | -0.71 | -0.211 |
Clarifying MLP vs. Transformer Projectors
Training with a transformer projector (for models originally with an MLP projector) was designed to isolate the contribution of model capacity by controlling for architectural differences in the projection head. Specifically, we tested whether transformer-based projectors alone account for any performance gains. We did not see any benefit from switching to transformer projectors in any of the baselines and the results in Tables 1 and 2 reflect original MLP-based models. We will clarify this in the paper and include the full transformer-projector results in the appendix. Below, you can see these results for the baselines on 3DIEBench (plus sign indicates a transformer projector).
| Setting\Metric | Top-1 Acc | R² (rel rot) | R² (abs rot) |
|---|---|---|---|
| SimCLR+ | 77.92 | 0.32 | 0.48 |
| BYOL+ | 81.05 | 0.09 | 0.16 |
| VICReg+ | 75.10 | 0.15 | 0.28 |
| VICRegTraj+ | 77.82 | 0.24 | 0.36 |
| SEN+ | 81.74 | 0.31 | 0.44 |
| Cond. BYOL+ | 73.62 | 0.27 | 0.36 |
Time and Memory Comparison
We kindly refer the reviewer to our FLOP and GPU-hour comparisons in the response to Reviewer 4. In summary, seq-JEPA maintains reasonable compute overhead (e.g., 12.3G–15.1G GPU-hours vs. 11–12.5G for baselines), and can amortize sequence cost by using low-resolution glimpses (e.g., a sequence of small foveated crops from a large image), which are viable in embodied vision (as exemplified in our PLS setup, where low-resolution glimpses support effective sequential aggregation).
Refs:
[A] Pose-aware Self-supervised Learning with Viewpoint Trajectory Regularization [B] Self-supervised Learning of Split Invariant Equivariant Representations [C] Learning and Leveraging World Models in Visual Representation Learning [D] Contrastive-equivariant SSL improves alignment with primate visual area it
Thank you for the clarifications.
Regarding the additional evaluations of equivariance, some quantitative retrieval measures could be presented to substantiate the claims such as PRE, MRR and H@K as presented in SIE. While this evaluation has its flaws, it can provide more tangible and interpretable measure of predicable transformations when compared with qualitative measures.
Thank you for the information on the transformation sequences, if accepted please add this more clearly to the manuscript as also noted by other reviewers.
While I completely understand and appreciate the computational limitations in play, it would be good to also present some information on the convergence of the methods analysed. From experience training SIE for only 1000 epochs does not reach convergence and in-fact, performance steadily increases up to 2000 epochs. Therefore, while your comparisons are fair, it must be stated more clearly that the compared methods are utilising hyper parameter configurations that are sub-optimal. However, if your method converges faster then this should be praised.
The additional results are appreciated.
We thank the reviewer for their constructive feedback and for highlighting two valuable points for further clarification. Below, we address these points.
Equivariance predictor evaluation metrics
We agree that retrieval-based predictor metrics offer a useful complement to rotation regression () in evaluating equivariance. Following the protocol in SIE [A], we computed MRR, Hit@1, and Hit@5 on the 3DIEBench validation for seq-JEPA and other predictor-based baselines. Results for SIE, EquiMod, and two seq-JEPA variants are shown below:
| Setting | MRR | H@1 | H@5 |
|---|---|---|---|
| SIE | 0.319 | 0.215 | 0.404 |
| EquiMod | 0.136 | 0.037 | 0.186 |
| seq-JEPA (1,1) | 0.340 | 0.241 | 0.442 |
| seq-JEPA (3,3) | 0.388 | 0.273 | 0.468 |
These results confirm that seq-JEPA achieves strong equivariance performance not only in terms of , but also in top-rank retrieval metrics.
On convergence and high-compute regime
To address the second point and examine convergence under the high-compute regime used by SIE, we trained five models in this regime, i.e., using 2000 epochs, 256×256 resolution, and batch size 1024. Each of these experiments was run on 4 A100 GPUs (our main experiments were run on a single A100).
Below are the evaluation results:
| Setting | Top-1 Acc | R² (rel rot) | R² (abs rot) | MRR | H@1 | H@5 |
|---|---|---|---|---|---|---|
| SIE* | 82.652 | 0.721 | 0.764 | 0.411 | 0.287 | 0.490 |
| SimCLR* | 85.961 | 0.473 | 0.609 | – | – | – |
| EquiMod* | 86.833 | 0.492 | 0.625 | 0.154 | 0.048 | 0.201 |
| seq-JEPA* (1,1) | 85.370 | 0.661 | 0.713 | 0.365 | 0.263 | 0.447 |
| seq-JEPA* (3,3) | 87.581 | 0.736 | 0.781 | 0.419 | 0.282 | 0.483 |
(* denotes high-compute regime)
These results confirm that seq-JEPA achieves a strong performance in the high-compute regime without suffering from a trade-off between invariance and equivariance. Importantly, our method achieves near-saturated performance already in the low-compute regime, indicating that it requires fewer epochs for convergence and is less sensitive to input resolution and batch size than competing methods.
We thank the reviewer again for these helpful suggestions. We will include the new results and clarifications in the final version of the paper.
References
[A] Garrido, Quentin, et al. Self-supervised Learning of Split Invariant Equivariant Representations. ICML 2023.
The authors propose a method to create representations that are both equivariant or invariant using a novel self-supervised architecture. They achieve this by augmenting input images with a sequence of transformed views paired with the corresponding transform taken to obtain them. By using both the image and the corresponding action taken in a JEPA-like approach, they manage to train a model that is both equivariant and invariant without the need to use any auxiliary losses. They validate that this is indeed the case using linear probes and they compare to standard baselines in both equivariant and invariant settings.
优缺点分析
Strengths:
- Problem is well defined and relevance clearly established.
- Architecture idea is intuitive and easy to understand, with a clear explanation of the different components.
- Experiments are well described and the models being compared to relevant to the hypothesis.
- Evaluation protocols are also clear.
Weakness:
- Reporting just the evaluation scores (e.g as in Table 1) clearly establishes that the model retains much more information about the transformations (i.e. becomes equivariant), but doesn't establish exactly why this is. Presumably this is because of the self-supervised training, but some more data would be nice.
问题
As mentioned above is it not clear to me why does the model keep the action information when aggregating. I assume this is because it makes predicting the next representation easier during training. Can the authors provide confirmation of this or whatever reason they hypothesized is behind this?
局限性
The authors appear to have addressed or acknowledge any limitation I can think of.
格式问题
None.
Response to Reviewer 2:
We thank the reviewer for their thoughtful and encouraging feedback. We are glad you found the problem setup well-defined, the architecture intuitive, and the evaluations thorough. Below, we address your main question and provide our intuition for how equivariant and invariant representations emerge in seq-JEPA—namely, why the encoder retains action information (equivariance) and how the aggregator yields invariant representations.
Why does seq-JEPA retain action information and develop equivariance?
We begin with some background. In contrastive SSL using InfoNCE, it has been shown that, under certain assumptions, the learned representations and can recover the data-generating factors of observations and [A]. However, this recovery only holds for factors shared across the pair, meaning factors that are invariant (e.g., object category). Transformation- or action-specific latents (e.g., pose or viewpoint) are, by design, not captured.
While non-contrastive predictive methods (e.g., BYOL, JEPA variants—including seq-JEPA) do not offer formal identifiability guarantees, they implicitly solve a latent-space world modeling task, aiming to model transitions of the form . If the model does not have access to the action , it must implicitly marginalize over it—leading to greater ambiguity and a broader solution space. In contrast, when (or a reliable estimate; see our response to Reviewer 4) is available, the encoder is incentivized to preserve action-relevant information. This makes the conditional prediction more tractable and structured (action-conditioning makes it easier to predict the next state).
Why does aggregation induce invariance?
Given the above, seq-JEPA exhibits two other emergent behaviors:
First, the aggregator naturally promotes invariance. It processes multiple transformed views (e.g., rotated object images) whose representations orbit around the category center in latent space (Figure 5, left panel). By attending to and averaging these, the aggregator filters out transformation-dependent components while reinforcing shared content—i.e., invariant factors. As a result, the aggregated representation gravitates toward the center of the class cluster, reducing intra-class variance and increasing inter-class separability in the aggregated latent space (see Figure 5, right panel).
Second, equivariance improves by observing longer sequences of action-observation pairs during training. Feeding additional tuples into the transformer-based aggregator creates a richer context. This aggregated context, denoted , is passed to the predictor, which learns to jointly encode and the next action in order to predict the future state . With access to a longer sequence—i.e., more transitions in working memory and a richer context—the predictor becomes better at capturing how specific actions influence latent dynamics. This enables it to more accurately approximate the transition distribution . Because accurate prediction requires the model to preserve and utilize information about , the encoder is implicitly encouraged to learn structured, more equivariant representations. Empirically, we indeed observe that increasing the training sequence length leads to improved equivariance performance (see Figure 7, left panel).
Emergence vs. enforcement
We emphasize that the emergence of both invariant and equivariant representations in seq-JEPA is not enforced explicitly, but rather emerges from its predictive world modeling objective and inductive biases. Our discussion above clarifies how components like action conditioning, sequential aggregation, and predictive modeling each contribute to this separation.
Notably, the model is not constrained to learn this decomposition. In principle, it could converge to a trivial solution by marginalizing over actions and learning a fully invariant world model. However, this does not occur in practice. Instead, the design of seq-JEPA naturally encourages the encoder to retain transformation-sensitive information (to predict transitions), while the aggregator averages over such views (to form invariant summaries). We view this emergent disentanglement as a key strength of the method and a promising direction for further investigation.
Reference:
[A] Zimmermann, Roland S., et al. Contrastive learning inverts the data generating process. ICML 2021.
I thank the authors for their detailed response. I was convinced of the value of their work before and thus maintain my score. I encourage them to incorporate some of their own comments above into the text as it can help future readers gain better understanding of how seq-JEPA works.
The submission introduces a novel self-supervised learning approach, referred to as seq-JEPA, that is able to learn a combination of invariant and equivariant representations of vision data. This is accomplished via architectural constraints, rather than through the use of a loss function that promotes these properties. The experimental evaluation demonstrates that the representations learned via this approach are useful for both determining the relative pose between images of the same object (presumably via equivariant features) while also being effective for classification (presumably via invariant features). There is also some investigation into the role of sequence lengths used in the training and inference process as well as the role of action conditioning.
优缺点分析
The paper is quite well-written. I found it quite easy to follow what was going on, for the most part, but I do have one clarification question below.
The proposed approach is quite simple; different transformations of the same image are embedded into a feature space, paired with a transformation description, serialised into a sequence, and fed into a transformer. The goal is to predict the transformation description of the final image in the sequence. This avoids much of the complexity associated with some SSL pipelines.
The tasks used to evaluate the proposed approach are reasonably sensible; there is some evidence from prior work to suggest that invariance and equivariance are important for object recognition and camera pose estimation, respectively. I am less familiar with the experiment related to human vision, but from a non-specialist point of view this seems fine. The ablations answer the main questions I had about designs of the method, so these are also appreciated.
I would have liked to see more discussion and quantitative comparison with other work that has aimed to develop representation learning approaches for similar settings. E.g., [A] develop a method for training a single feature extractor that is able to encode both invariance and sensitivities to various transformations. It would be good to understand to what extent the current work is trying to solve a similar problem, and how the solutions match up.
[A] Chavhan et al. Quality Diversity for Visual Pre-Training. In ICCV, 2023.
问题
Can you provide more detail about how the actions are represented? This was not very clear in the manuscript.
局限性
Limitations are discussed in the paper.
最终评判理由
My overall assessment of the paper (somewhat positive) is unchanged by the authors response.
格式问题
N/A
Response to Reviewer 1:
We thank the reviewer for their constructive comments and for highlighting the clarity and simplicity of our approach. We are pleased that you found the method easy to follow and appreciated the ablations and task selection. Below, we address your main concerns and clarify your question about the representation of actions.
Connection to [A]:
We thank the reviewer for pointing out the relevant work of Chavhan et al. [A], which indeed bears a conceptual connection to ours. In [A], the authors use an ensemble of heads trained on top of a base encoder (MoCo) to span a diverse spectrum of transformation sensitivities across the latent spaces of each head (encouraged via a diversity loss). Downstream probes then learn to linearly combine these feature heads depending on the desired invariance-equivariance trade-off.
While the motivation is related, seq‑JEPA offers a fundamentally different and more architectural approach. In our framework, equivariant and invariant representations emerge within a single model, with no need for additional heads or specialized loss functions. Specifically, the encoder produces equivariant representations, while the sequence aggregator (transformer) forms an invariant representation by aggregating views. This dual structure arises purely from our architectural inductive bias and predictive learning objective, which enables seq-JEPA to navigate the invariance-equivariance trade-off through its core architectural design, without requiring explicit regularization or auxiliary heads.
We will add a discussion of this connection in the final version to better position our work in the landscape of hybrid invariant-equivariant representation learning.
Clarification on Action Representation:
As noted in Section 2.2 of the paper, the relative transformations (actions) $a_i are defined as the transformation that maps \x_i to \x_{i+1}, i.e., \a_i := t_{i → i+1}$.
The specific instantiation of the action vector depends on the transformation setting. We have provided these details at the end of Appendix A, but we will clarify this more explicitly in the main text in the final version of the paper:
3DIEBench (Rotations): We use the 4D quaternion vector representing the relative rotation between two views, consistent with the original benchmark [B].
Predictive Learning Across Saccades: The action is a 2D vector indicating the normalized relative (x, y) position between the centers of two patches, simulating a saccadic eye movement.
Hand-Crafted Augmentations (Crop, Jitter, Blur): Each augmentation is parameterized as follows, and the corresponding action vector a_i is computed as the difference in parameters across views:
Crop: [x, y, height, width] (4D)
Color jitter: [brightness, contrast, saturation, hue] (4D)
Blur: [kernel standard deviation] (1D)
In all settings, the action vector is passed through a learnable linear projector (default: 128D), then concatenated to the visual representation before being passed to the transformer aggregator.
We again thank the reviewer for their thoughtful feedback and hope the above clarifications, along with the related responses to other reviewers, help further contextualize our design choices and contributions.
References:
[A] Chavhan, Ruchika, et al. Quality Diversity for Visual Pre-Training. ICCV 2023.
[B] Garrido, Quentin, Laurent Najman, and Yann Lecun. Self-supervised learning of split invariant equivariant representations. arXiv preprint arXiv:2302.10283 (2023).
Thank you for the clarification on action representations, that helps. This information should be moved to the camera ready copy if the paper is accepted.
Regarding discussion with prior work: some details about the differences/similarities in the mechanisms used to achieve both invariance and equivariance is good. However, I was looking for a more high-level discussion about the similarity of the problems being solved by related methods. I.e., can the proposed approach be used as a drop-in replacement for previous work that has sought to develop invariant-equivariant representations? If not, what is different about the motivation of the proposed approach?
We thank the reviewer for their follow-up and are glad the clarification on action representations was helpful. We will make sure to reflect this clarification in the final version of the paper.
We also appreciate the deeper question regarding the broader problem setting and interchangeability with related methods. While methods such as [A] and [B] share the high-level goal of learning both invariant and equivariant features, they differ from seq-JEPA in motivation, mechanism, and operational context:
In [A] and [B], the goal is to explicitly diversify the feature space by engineering representations with varying sensitivity to transformations. For example, [A] trains an ensemble of encoder heads, each encouraged via auxiliary losses to capture different points along the invariance-equivariance spectrum. [B] uses similar mechanisms with projection heads and multiple contrastive losses. These methods rely on explicit enforcement, using loss functions and architectural components to span the desired spectrum. The learned representations are then selected or combined at test time to match downstream task requirements.
In contrast, seq-JEPA is motivated by predictive world modeling in active perception. It models transitions of the form , where the agent observes transformed views. In this setting, equivariance and invariance to transformations are not explicitly enforced goals, but instead emerge naturally as a consequence of the model's predictive objective. Specifically, equivariance emerges from the encoder’s need to retain action-relevant structure for accurate prediction, while invariance emerges through aggregation across diverse transformed views (see our rebuttal to Reviewer 2, where we detail how these properties arise from the predictive objective and sequence aggregation). Thus, the mechanism is emergence, not enforcement.
To directly address the reviewer’s question: seq-JEPA is not intended as a drop-in replacement for approaches like [A] or [B]. These methods assume a passive, feature-centric setup where transformation sensitivity is tuned for downstream transfer. In contrast, seq-JEPA is designed for predictive world modeling, where modeling transformations is part of the learning signal. The inductive biases and learning dynamics are tailored for agents that observe and act in their environment. That said, we view these approaches as complementary, and believe seq-JEPA opens the door to unifying predictive modeling with the kind of spectrum-based feature control as in [A] and [B].
We will incorporate these clarifications in the final version of the paper.
Again, we appreciate the reviewer’s engagement and hope our clarifications have helped better contextualize our contributions and situate our work more precisely within the broader literature.
References
[A] Quality Diversity for Visual Pre-Training, ICCV 2023
[B] Xiao, Tete, et al. "What Should Not Be Contrastive in Contrastive Learning." ICLR 2021
Summary of paper:
This paper introduces seq-JEPA, a self-supervised framework that learns representations which are simultaneously invariant and equivariant to transformations. By embedding transformation sequences and predicting their outcomes, the method avoids auxiliary losses and achieves strong performance across invariant, equivariant, and transformation prediction tasks.
Summary of Reviews:
The reviewers agree that the paper is well-written, the problem well-motivated, and the architecture intuitive. Strengths highlighted include the clarity of experimental protocols, the novelty of jointly modeling invariance and equivariance, and the broad applicability across benchmarks. Concerns focused on the level of comparison with prior work, the role of ground-truth transformations, and the computational overhead. Reviewers gave borderline-to-positive ratings, but engaged constructively with the rebuttal.
Decision Rationale:
The authors’ response directly addressed reviewer concerns: they clarified design choices, provided additional comparisons (including time/memory trade-offs), and explained the role of learned representations in capturing invariance/equivariance. The discussion further demonstrated the robustness and novelty of the approach. Taken together, the paper advances representation learning with a clear, well-executed idea, strong results, and thoughtful positioning relative to existing methods. While some limitations remain, they do not outweigh the contributions.