PaperHub
6.4
/10
Poster4 位审稿人
最低4最高4标准差0.0
4
4
4
4
3.3
置信度
创新性2.3
质量2.5
清晰度2.5
重要性2.3
NeurIPS 2025

Token Bottleneck: One Token to Remember Dynamics

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29
TL;DR

We introduces a simple yet effective pipeline called Token Bottleneck that facilitates conservative summarization of the observed scene into a bottleneck token while enable capturing of dynamic transitions through the bottleneck token.

摘要

关键词
Self-supervised LearningRepresentation Learning

评审与讨论

审稿意见
4

This work depicts that a single token is sufficient to compress the spatial and temporal dynamics of a video. Specifically, this work demonstrates that utilizing the class token of the reference frame in conjunction with a few target patches facilitates the learning of sequential scene representations.

优缺点分析

Strengths:-

  • The paper is predominantly well-written.
  • Extensive experiments on various downstream tasks such as robot policy learning have been conducted.
  • The main insight presented in the paper that utilizing a single token from the reference frame instead of utilizing all the patches is more beneficial, is very interesting.

Weakness:-

  • The decoder architecture is not clear. While the visualization in figure 3 indicates attention from the bottleneck token to the unmasked patches, it is mentioned in L-156 that self-attention is employed. A better description would be helpful.
  • While the experiments are predominantly comprehensive, more comparisons with works performing masked image modeling on latent space such as i-BOT, DINOv2 and T-LORE could have been conducted.

问题

See weaknesses

局限性

yes

最终评判理由

The authors have addressed many of my concerns and I believe that this paper can be accepted. So I’m maintaining my rating.

格式问题

none

作者回复

We sincerely thank Reviewer 3Edo for the thoughtful comments and constructive suggestions. Below, we provide detailed responses to each point.

W1. Clarification on the description of the decoder architecture

We apologize for the confusion. During the decoding process, the visible and masked target patches are concatenated with the bottleneck token and then fed into the self-attention layer. We promise to address the confusions in the descriptions.

W2. Comparison with i-BOT, DINOv2 and T-CORE

We provide the comparison results of DINOv2 [11], iBoT [15], and T-CORE [16] in five tasks in Franka Kitchen. Since we have provided the performance of DINOv2 in Franka Kitchen in the supplementary material (Table A), we directly report the results in Table R13. Moreover, we further evaluate the performance of iBoT and T-LORE. As shown in Table R13, our ToBo surpasses DINOv2, iBoT, and T-LORE on all the five tasks. We promise to provide the performance of DINOv2, iBoT, T-LORE in other benchmarks in the revised version.

[Table R13. Comparison of ToBo with DINOv2 and T-CORE on Franka Kitchen]

Knob1 onLight onSdoor openLdoor openMicro open
DINOv227.338.782.016.020.0
iBoT37.355.375.312.747.3
T-CORE7.317.369.311.38.0
ToBo (ours)57.382.095.051.055.0

References

[11] Oquab et. al., "DINOv2: Learning Robust Visual Features without Supervision", arXiv:2304.07193.

[15] Zhou et., al., iBOT: Image BERT Pre-Training with Online Tokenizer, ICLR 2022.

[16] Liu et. al., "When the Future Becomes the Past: Taming Temporal Correspondence for Self-supervised Video Representation Learning", CVPR 2025.

评论

I would like to thank the authors for their rebuttal. All of my concerns have been well addressed. However, I do think that other reviewers have raised valid points and would like to wait for their opinions on the rebuttals before making a decision on my final rating.

评论

Dear reviewer 3Edo,

We appreciate your timely response to authors. We encourage you to update the score nevertheless, based on your own assessment, since the concerns of other reviewers will be factored in their own scores already. The authors have posted the rebuttal to other reviewers, if you still would like to take them into account.

-- Your AC

评论

I would like to maintain my previous positive rating.

评论

Thank you again for your valuable comments! We're glad that your concerns have been addressed. Please feel free to let us know if there are any remaining points we can further clarify!

审稿意见
4

The authors propose a self-supervised vision pretraining method for sequential scene based tasks by bottlenecking the representation using a single token. The authors demonstrate the effectiveness of this approach in simulation and real world robotics tasks, with significant improvements over other methods like SimCLR and MAE(and its variants). The authors also demonstrate superior performance of their proposed method in label(mask,pose) propagation tasks over these baselines.

优缺点分析

Strengths

  1. The authors propose a novel self-supervised pretraining method to capture both temporal dynamics and summarize scene information in the same embedding.
  2. The authors demonstrate the superior performance of their proposed method in robotics tasks and label propagation when compared to other SSL methods like MAE variants and SimCLR.

Weaknesses

  1. The authors do not compare with newer self supervised vision pretraining methods like DINO v2 (except franka-kitchen) and I/V-JEPA on all the tasks. This is required to contextualise the performance of the proposed method in the current SSL landscape.
  2. The authors need to demonstrate the scaling property of the method, when trained on more data and not just a larger model.
  3. The improvements in label propagation are much smaller than the improvements in robotics policy learning tasks. This needs to be explained/investigated. Is this an artefact of the agent's policy architecture being more suited to the proposed SSL method as opposed to the other SSL methods? Is the agent policy architecture the same for all the robotics tasks ?
  4. While the authors focus on bottlenecking the representation using a single embedding, it would be interesting to see the effect of increasing the number of embeddings in the bottleneck effectively relaxing the constraint gradually.

问题

see weaknesses

局限性

yes

最终评判理由

The proposed SSL method beats SoTA methods like I/V-JEPA, DinoV2 on robotics tasks and can scale with increase in training data.

格式问题

NA

作者回复

We sincerely thank Reviewer dztw for the thoughtful comments and constructive suggestions. Below, we provide detailed responses to each point.

W1. Comparison with DINO v2 (beyond franka-kitchen), I-JEPA, and V-JEPA

Following the comments, we provide the comparison results of DINOv2 [11], I-JEPA [12], V-JEPA [13]. We provide the performance of I-JEPA and V-JEPA for five tasks of Franka Kitchen in Table R10. Moreover, since we have provided the performance of DINOv2 in Franka Kitchen in the supplementary material (Table A), we additionally compare ToBo with DINOv2 on 'relocate' and 'pen' tasks in Adroit in Table R11. As shown in Table R10, our ToBo surpass DINOv2, I-JEPA, and V-JEPA on all the tasks in Franka Kitchen. Moreover, ToBo also outperforms DINOv2 on all the tasks in Adroit. These results further support the effectiveness of ToBo. We promise to provide the performance of DINOv2, I-JEPA, V-JEPA in other benchmarks in the revised version.

[Table R10. Comparison of ToBo with DINOv2, I-JEPA, and V-JEPA on Franka Kitchen]

Knob1 onLight onSdoor openLdoor openMicro open
DINOv227.338.782.016.020.0
I-JEPA26.040.078.721.317.3
V-JEPA21.341.374.728.012.7
ToBo (ours)46.778.795.347.337.3

[Table R11. Comparison of ToBo and DINOv2 on Adroit]

PenRelocate
DINOv269.624.0
ToBo (ours)81.639.2

W2. Scalability of our method when trained on more samples

Following the advice, we verify the scalability of ToBo when trained on more samples. To this end, we compare ToBo models pre-trained on Kinetics-400 and Kinetics-600, a larger dataset that includes Kinetics-400, for 100 epochs. Table R12 showcases that the ToBo model pre-trained on Kinetics600 surpasses the ToBo model pre-trained on Kinetics-400in all tasks except for the "Sdoor Open" task, where both models perform comparably. This results validate the scalability of ToBo for more samples.

[Table R12. Scalability of our method when trained on more samples]

Knob1 onLight onSdoor openLdoor openMicro open
ToBo pre-trained on Kinetics-40046.778.795.347.337.3
ToBo pre-trained on Kinetics-60051.080.394.748.745.3

W3. Clarification for much smaller improvements on label propagation

The improvements in video label propagation are smaller because it requires dense and spatially distributed predictions, which do not directly align with our SSL objective. In contrast, the strong gains in robotics stems from the nature of our SSL objective, which enforces conservative summarization of observed scenes into a bottleneck token, a representation better suited for action prediction under partial observability.

The following are responses to the sub-questions:

Response to "Is the agent's policy architecture being more suited to the proposed SSL method?" The observed improvements in robotics tasks are not due to any special advantage in policy architecture. Our method directly adopts the policy network design and the evaluation protocol from prior works [3], without any modification specific to our SSL objective. Importantly, all compared SSL methods share the same policy architecture, ensuring a fair comparison. Therefore, the gains are not attributable to an architectural bias toward our method.

Response to "Is the agent policy architecture the same for all the robotics tasks?" Yes, the same policy network architecture is used across all robotics tasks. This uniform setup eliminates architectural variation as a source of performance gaps. Although the training mechanisms of models vary across SSL methods, all models share the same structure, which enables the use of a single policy design across the pre-trained models.

References

[1] He et. al., "Masked Autoencoders are Scalable Vision Learners," CVPR 2022.

[2] Gupta et. al., "Siamese masked autoencoders," NeurIPS 2023.

[3] Jang et. al., "Visual representation learning with stochastic frame prediction", ICML 2024.

[6] Kay et. al., "The kinetics human action video dataset," arXiv:1705.06950, 2017.

[11] Oquab et. al., "DINOv2: Learning Robust Visual Features without Supervision", arXiv:2304.07193.

[12] Bardes et. al., "Revisiting Feature Prediction for Learning Visual Representations from Video", arXiv:2404.08471.

[13] Assran et. al., "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture," ICCV 2023.

评论

I would like to thank the authors for their response. My concerns have been addressed.

评论

Thank you for responding to authors. Would you mind to do the "Mandatory Acknowledgement" for the review? Did the author's response change your score?

-- Your AC

评论

I would like to maintain my positive rating.

评论

Thank you for the thoughtful feedback! We're glad that our response addressed your concerns. Please feel free to let us know if there are any remaining points we can further clarify!

审稿意见
4

The paper proposes a method for learning visual representations. The method is based on MAE and called Token Bottleneck. The basic idea is to learn representations from a pair of images of the same scene (two different views / video frames). The method trains a model to compress the patches of the first image into a single bottleneck token and, from that token plus visible patches from the second image, predict the masked patches of the second image. The results are evaluated on several simulated robotic tasks and davis label propagation.

优缺点分析

Strengths:

  • I reviewed an earlier version of this paper for ICCV 2025. Compared to that version, the present submission is considerably improved. The writing, method name, and description are much clearer. Some of the missing baselines are included too. I appreciate the updates and thank the authors for incorporating the feedback from the previous review.
  • The method is simple, easy to understand, and shows promising signal in tested settings.

Weaknesses:

    1. Key design choices are not ablated which makes it difficult to understand different aspects of the proposed method. Some examples:
    • a) How many patches to squeeze the information into? Single token vs multiple? How does the tradeoff look like?
    • b) How does the temporal difference impact the performance?
    • c) What would happen if there was no temporal difference between the frames? Would the task be trivial or still useful?
    • d) ...
    1. It would be good to include comparisons between different methods (e.g., vanilla MAE, SimSiam, etc.) in terms of flops. Both training time flops and inference flops.
    1. Comparisons on standard benchmarks like ImageNet and Kinetics missing. I understand that the main focus is on simulated robotic evaluation and tracking but it would still be helpful to see how the learnt representations compare to standard approaches like MAE on standard benchmarks.
    1. It would be good to see straightforward baselines like running vanilla MAE across both views/images.

问题

Please see the weaknesses above.

局限性

Yes.

最终评判理由

The author rebuttal addressed by questions and updated the score accordingly.

格式问题

No formatting concerns.

作者回复

We sincerely thank Reviewer W34j for the thoughtful comments and constructive suggestions. Below, we provide detailed responses to each point.

W1. Ablation studies on key design choices

We agree that the commented factors are important for understanding the behaviors of ToBo. Thus, we conduct ablation studies on the suggested aspects. All the ablation studies are conducted using models pre-trained for 100 epochs.

  • Ablation study on the number of bottleneck tokens

    • We conducted an ablation study on the number of bottleneck tokens, varying it among {1, 2, 4, 8}. As shown in Table R1, we observe that using a single token generally yields the best performance across tasks. These results demonstrate that conservative summarization without separate storage is beneficial for understanding the current observation, thereby improving action prediction in robotics
  • Ablation study on the impact of temporal difference

    • We conducted additional experiments where the maximum frame interval is varied among (48, 96, 144). As shown in Table R2, the models are encouraged better to learn to capture dynamic scene evolution when trained with moderate temporal differences, not too short to include meaningful changes and not too long to break temporal coherence.
  • What if there is no temporal difference between the frames?

    • Following the comment, we applied our method using the same frame for both the source and target scenes. As shown in Table R3, we found that our method still works even without temporal difference, surpassing other baselines (e.g., MAE [1], SiamMAE [2], and RSP [3]) with significant gaps across the tasks. However, its overall performance degrades compared to original ToBo since it loses the opportunity to learn how to capture dynamic evolutions from consecutive scenes. This highlights the importance of temporal contrast for effective pre-training of ToBo.
  • What if multiple source frames are given?

    • We further compare ToBo with the model pre-trained using multiple source frames. Specifically, we randomly sample 4 frames as the source frames and pre-train the model for 100 epochs under the same training recipe to ToBo. As shown in Table R4, the multi-frame based model surpasses other baselines (e.g., MAE [1], SiamMAE [2], and RSP [3]) in most of the tasks. However, despite requiring higher pre-training cost, it underperforms compared to ToBo across all robotics tasks. These results suggest that while it is possible to extend ToBo to multi-frame settings, such naive extension may encounter potential new challenges, leading to suboptimal performance.

[Table R1. Ablation study on the number of bottleneck tokens]

Number of bottleneck tokensKnob1 onLight onSdoor openLdoor openMicro open
146.778.795.347.337.3
231.054.074.026.024.0
428.024.378.028.022.0
810.020.056.026.09.3

[Table R2. Ablation study on the impact of temporal difference]

Maximum temporal differenceKnob1 onLight onSdoor openLdoor openMicro open
4840.778.796.044.035.3
9646.778.795.347.337.3
14436.069.397.346.739.3

[Table R3. Analysis on the importance of temporal difference]

Knob1 onLight onSdoor openLdoor openMicro open
MAE18.721.370.017.315.3
SiamMAE18.034.080.718.719.3
RSP24.751.787.323.326.7
ToBo (no temporal difference)41.072.089.332.732.0
ToBo (with temporal difference)46.778.795.347.337.3

[Table R4. Analysis on the multiple source frame]

Knob1 onLight onSdoor openLdoor openMicro open
MAE18.721.370.017.315.3
SiamMAE18.034.080.718.719.3
RSP24.751.787.323.326.7
ToBo (multi-frame)28.760.792.720.732.0
ToBo46.778.795.347.337.3

W2. Comparison of training and inference flops

We conducted FLOPs evaluation for both training and inference to quantitatively compare the computational cost of each model, as summarized in Table R5.

  • During inference, all models use the same backbone architecture and input resolution without any input masking, resulting in identical inference FLOPs at the same model scale (e.g., 4.6 GFLOPs for ViT-Small).
  • During training, ToBo, MAE [1], and SiamMAE [2] show similar computational costs while RSP [3] requires substantially more computation of 32.5 GFLOPs due to its complex decoding mechanisms. When considering computational costs with downstream performance (e.g., performance in Franka Kitchen [4]), these results further support the effectiveness of ToBo, which achieves a strong balance between efficiency and performance.

[Table R5. Comparison of training and inference FLOPs and downstream performance in Franka Kitchen ]

MethodTraining FLOPs (GFLOPs)Inference FLOPs (GFLOPs)Franka Kitchen (%)
MAE13.04.626.1
SiamMAE13.14.630.4
RSP32.54.643.4
ToBo15.94.668.1

W3. Comparisons of ToBo and MAE on Kinetics-400

Following the comment, we compare the action classification accuracy of ToBo and the baselines (i.e., MAE [1], SiamMAE [2], and RSP [3]) on the Kinetics-400 dataset [6]. To this end, we fine-tune all models under a reduced setting of 40 epochs, due to the limited rebuttal period. As shown in Table R6, while the performance gaps are relatively smaller than those appears in our main downstream tasks in the main paper, ToBo achieves the highest accuracy among all models, demonstrating its effectiveness on this standard video classification benchmark.

[Table R6. Comparison of ToBo and other baselines in action recognition in Kinetics-400]

Kinetics-400 (%)
MAE39.9
SiamMAE38.1
RSP38.2
ToBo40.8

W4. Comparison with straightforward baselines

We apologize for the confusion. We would like to clarify that MAE [1] in the tables of the main paper is indeed a straightforward baseline that performs masked image modeling independently on both input views. We will improve the description for the baselines.

References

[1] He et. al., "Masked Autoencoders are Scalable Vision Learners," CVPR 2022.

[2] Gupta et. al., "Siamese masked autoencoders," NeurIPS 2023.

[3] Jang et. al., "Visual representation learning with stochastic frame prediction", ICML 2024.

[4] Gupta et. al., "Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. CoRL 2019.

[5] Russakovsky et. al., "Imagenet large scale visual recognition challenge," IJCV 2015.

[6] Kay et. al., "The kinetics human action video dataset," arXiv:1705.06950, 2017.

评论

Dear Reviewer W34j,

The authors have provided extensive ablations on design choices, flops and other datasets. As the Author-Reviewer discussion is coming to an end, we appreciate your participation in the discussion. Did the the authors answer your concerns? Does it change your score?

-- Your AC

评论

I thank the authors for the response. The rebuttal addressed my questions.

评论

Thank you again for your constructive comments! We're glad that your questions have been addressed. Are there any remaining points we can further clarify? Please feel free to let us know if there are any remaining concerns!

审稿意见
4

This paper investigates how to use the dynamics information in video to pretrain an image encoder, extending MAE. The proposed ToBo(Token Bottleneck) has a unique pretext task of compressing an image into a single token, then decoding a later frame based on the bottleneck token and sparse target frame patches. The method is mainly evaluated on video label propagation and robotics tasks, yielding especially large improvements in robotics manipulation.

优缺点分析

Strength:

  1. The proposed approach is simple but effective. The pretext task intuitively makes sense and is grounded in previous literature.
  2. The improvements on especially robotics tasks is impressive.
  3. Robotics experiments include error bars, which is informative.

Weaknesses:

  1. The variance on the proposed approach seems especially large, especially in Table 3, compared to the baselines. Could the author provide some insight into why that is the case?
  2. More careful discussion of the limitation should be included in the final paper.
  3. Since we are learning representation from video, is it possible to take the VideoMAE approach to pretrain a video model, then somehow adapt it to the robotics use case?

问题

Please see the weakness section. To summarize, I have questions regarding the variance of performances, more careful discussion of limitations, and direct video pretraining. I am open to increasing the score if these items are clarified.

局限性

The limitation discussion is too short, probably due to paper length limitation. But it should be revisited more carefully.

最终评判理由

I believe this paper makes good progress on representation learning and it is meaningful to the community. My final rating is borderline accept.

格式问题

None.

作者回复

We sincerely thank Reviewer GhiU for the thoughtful comments and constructive suggestions. Below, we provide detailed responses to each point.

W1. Clarification on concerns regarding variance

  • While addressing your comment, we found that the variance of ToBo in the DMC benchmark [7] was incorrectly reported as 8.6, when it should have been 0.86. We will correct this error (8.6 --> 0.9) in Table 2 of the main paper, as reflected in Table R7.
  • ToBo may tend to exhibit relatively high variance in RLBench [10]. However, we believe this reflects a broader trend where model variances vary significantly depending on the evaluation benchmark. In fact, on other benchmarks such as CortexBench [8] (Table R7) and Franka Kitchen [4] (Table R8), ToBo shows lower or comparable variance than other methods except “Sdoor Open”. Moreover, even in the "Sdoor Open" task, ToBo records a variance of 7.1, which is lower than that of SiamMAE (7.9) and CropMAE (8.1). These results suggest that variance is task-dependent, and ToBo maintains stable performance in several benchmarks despite relatively higher sensitivity in certain RLBench scenarios

[Table R7. Experimental results on vision-based robot policy learning on CortexBench.]

TaskSimCLRMoCo v3DINOMAESiamMAERSPCropMAEToBo
Adroit40.4±3.339.6±4.345.6±6.239.6±4.344.0±6.645.6±4.650.0±5.160.4±2.2
MetaWorld78.4±5.265.4±8.082.4±5.865.4±8.081.1±6.384.5±6.682.4±5.887.8±4.6
DMC39.7±2.943.7±3.250.9±1.543.7±3.256.0±2.961.6±3.446.4±1.173.5±0.9
TriFinger63.3±3.353.3±1.664.2±3.553.3±1.652.1±7.666.2±0.846.3±1.766.5±1.0

[Table R8. Experimental results on vision-based robot policy learning on Franka Kitchen.]

TasksSimCLRMoCo v3DINOMAESiamMAERSPCropMAEToBo
Knob1 on25.3±2.111.5±3.927.0±3.212.0±3.316.8±4.431.0±2.431.5±5.357.3±2.3
Light on55.8±6.424.3±5.044.3±6.524.3±4.236.5±7.044.5±5.654.0±11.282.0±1.6
Sdoor open72.3±2.866.5±3.277.0±5.071.5±4.368.0±7.982.5±2.777.0±8.195.0±7.1
Ldoor open17.0±2.910.3±2.116.5±2.512.8±3.917.3±3.728.8±4.825.5±5.751.0±1.4
Micro open23.3±2.814.3±2.528.5±4.810.0±2.813.5±4.830.3±5.632.5±4.155.0±1.4

W2. Supplement to the limitation part

Thank you for the advice. We improved the discussion in the limitation part: Due to the resource constraints, we both did not check the scalability of our method beyond huge scale and did not explore beyond the commonly used input resolution of 224x224. Additionally, our study focused on a simplest setting involving two dynamic scenes to learn temporal dynamics. Extending our pipeline to multiple frames setting would be an interesting direction for future work, solving new potential challenges emerging from multi-frame setting.

W3. Extension of ToBo to multi-frame-based approach

Following the comment, we compare ToBo with the model pre-trained using multiple source frames. Specifically, we randomly sample 4 frames as the source frames and pre-train the model for 100 epochs under the same training recipe to ToBo. As shown in Table R9, the multi-frame based model surpasses other baselines (e.g., MAE [1], SiamMAE [2], and RSP [3]) in most of the tasks. However, despite requiring higher pre-training cost, it underperforms compared to ToBo across all robotics tasks. These results suggest that while it is possible to extend ToBo to multi-frame settings, such naive extension may encounter potential new challenges, leading to suboptimal performance.

[Table R9. Comparison with the naive extension of ToBo to multi-frame-based setting]

Knob1 onLight onSdoor openLdoor openMicro open
MAE18.721.370.017.315.3
SiamMAE18.034.080.718.719.3
RSP24.751.787.323.326.7
ToBo (multi-frame)28.760.792.720.732.0
ToBo46.778.795.347.337.3

References

[1] He et. al., "Masked Autoencoders are Scalable Vision Learners," CVPR 2022.

[2] Gupta et. al., "Siamese masked autoencoders," NeurIPS 2023.

[3] Jang et. al., "Visual representation learning with stochastic frame prediction", ICML 2024.

[4] Gupta et. al., "Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. CoRL 2019.

[7] Tassa et. al., "dm_control: Software and tasks for continuous control," arXiv:2006.12983, 2020.

[8] Majumdar et. al., "Where are we in the search for an artificial visual cortex for embodied intelligence?," NeurIPS 2023.

[9] Eymaël et. al., "Efficient image pre-training with siamese cropped masked autoencoders," ECCV 2025.

[10] James et. al., "Rlbench: The robot learning benchmark & learning environment," IEEE Robotics and Automation Letters, 2020.

评论

Thank you again for the insightful comments! We're glad that your questions have been resolved. Please feel free to let us know if there are any remaining points we can further clarify!

评论

Thanks for the authors' response! My questions are resolved. I maintain the previous positive rating.

最终决定

The paper introduces a Token Bottleneck approach that allows to learn a self-supervised representation of the sequential scene by squeezing it into a token bottleneck. The model is trained to reconstruct a scene solely from the token bottleneck, together with a few target “hints” (patches of the target scene). The usefulness of the approach is demonstrated on real-world robot manipulation tasks.

The reviewers praise that the approach is simple but effective, particularly performing well on the robotics tasks. During the rebuttal, the authors also extended the ToBo method to a multi-frame approach, which added a valuable comparison. The authors also provided extensive comparisons and ablations to other existing self-supervised methods and demonstrated that token bottleneck remains superior.

The recommendation is Accept.