Rethinking Latent Redundancy in Behavior Cloning: An Information Bottleneck Approach for Robot Manipulation
Integration of information theory with behavior cloning in robot manipulation
摘要
评审与讨论
This paper presents a method for incorporating the principal of information bottleneck (IB) into behavior cloning. They accomplish this by adding a mutual information objective to the loss function alongside the standard action regression term. This extra objective works to reduce the overfitting in low-data regime.
给作者的问题
In figure 2.c, why does only the [CLS] embedding go into the policy head?
I think the jump from I(Z;A) to the regression loss in equation 6 could be better explained. I realize they maximize the same thing and that such a substitution is common in IB, but it would still be helpful to acknowledge it for those unfamiliar.
论据与证据
The paper claims that reducing latent redundancy via the IB principle significantly enhances generalization in BC for robot manipulation. This claim is supported by solid theoretical analysis, including generalization bounds, and by experiments across multiple benchmarks. The improvements in success rates and ablation studies. Additional experiments on out of distribution generalization would be informative.
方法与评估标准
The proposed approach of integrating an IB regularization term into the standard BC loss using Mutual Information Neural Estimation (MINE) is simple, yet novel for this particular application.
The choice of benchmarks (CortexBench for single-task and LIBERO for multi-task, language-conditioned settings) is well justified. Although LIBERO does not actually test language generalization (the language input is merely a task indicator), this baseline is still useful for showing that the IB objective doesn't prevent the network from scaling to multi-task settings.
理论论述
I did not check the correctness of the proofs in the appendix.
实验设计与分析
The paper compares several BC models (R3M, Voltron, VC-1, MPI, and more) with and without the IB enhancement, and it evaluates both spatial and temporal fusion methods. The set of baselines and tasks was comprehensive, and the results show a clear benefit to using IB across all model types. Additionally, detailed analyses on the effects of different β values and the performance under few-shot conditions further strengthen the study.
The real-world experiments are somewhat limited to a few rollouts on very simple tasks with no statistical analysis.
补充材料
I reviewed the real world experiments and additional experimental results.
与现有文献的关系
The Information Bottleneck (IB) principle is a well-established concept in neural networks, supported by extensive theoretical foundations. This paper adapts the IB principle to behavior cloning, a field where data scarcity is common, making it a particularly promising candidate for regularization using an IB loss term.
遗漏的重要参考文献
N/A
其他优缺点
Strengths:
- Combines strong theoretical analysis from the IB literature with extensive empirical validation.
- Demonstrates how IB can help with in-distribution generalization in data-scarce regimes like behavior cloning for robot manipulaton
- Uses a diverse set of benchmarks and ablation studies to support the claims.
Weaknesses
- Limited novelty beyond adding a loss term to BC, and the benefits are marginal in most cases.
- No testing of out of distribution generalization
- Limited/unconvincing real-world evaluations
其他意见或建议
N/A
Please open link to anonymous material for easy reference.
Comment 1 on Novelty: Limited novelty beyond adding a loss term to BC.
Our contributions go beyond simply adding a loss term to BC. Specifically, we highlight the following:
- Method Novelty
(1) As shown in Figure 1 (a) of the anonymous material, traditional IB methods [1-3] typically apply the bottleneck to a single modality (e.g., images) and its latent features, with the information flow being O→Z→A. (2) In contrast, BC for robot manipulation is more complex than earlier control or single-modal tasks, requiring models to handle diverse, multimodal data and inevitably necessitating feature fusion. If strictly following the previous IB paradigm (as shown in Figure 2 of the anonymous material) would require constraining each input modality and its features individually, resulting in a pipeline that lacks elegance and scalability, is highly complex, and lacks the ability to capture cross-modal associations. (3) Thus, we propose redefining the "input" at the representation level and treating the features before the policy head as latent features, with the information flow being O→X→Z→A. The advantages of this approach are: 1) It allows us to apply redundancy reduction in a unified way across all modalities. 2) In practice, encoders are often frozen, making this approach more effective in such cases. 3) It better scales to a wider range of robotic algorithms.
- Theoretical Novelty
As shown in Figure 1 of the anonymous material, we adapt Theorems 4.1 and 4.2 to support the robot task with information flow of O→Z→A. To demonstrate that our paradigm remains compatible with these theorems, we present Theorem 4.3: Even if we optimize I(X;Z) by applying the bottleneck at an intermediate feature level X, as long as X preserves the essential structure of the original input o, we are effectively controlling I(O;Z), with the difference bounded by a small constant δ.
- To support our claims, we conduct extensive experiments across various models and settings.
Comment 2 on Experiment Performance: The improvements are marginal in most cases.
The marginal improvements are not due to limitations of the IB theory, but rather due to:
- Model capacity and task complexity: Our models, such as BC-ViLT+IB, have limited capacity (e.g., 10M parameters). They perform well on simpler tasks like LIBERO-Goal (80% success, 8% improvement), but show limited gains (2%) on more complex tasks like LIBERO-Long, due to the performance ceiling of the baselines. We illustrate this issue with Difusion Policy (The MLP head (1.14M parameters) of BC-Transformer is replaced with the DP head(90M)) experiments on LIBERO-Long.
| Method | LIBERO-Long |
|---|---|
| DP | 78.0 |
| DP+IB | 84.0 |
- Dataset and task complexity: In benchmarks like Cortexbench, particularly Trifinger (Figure 7 in the appendix), the visual input contains only the robot arm and objects, with minimal redundant information, so applying IB yields minimal gains. In contrast, LIBERO environments feature more distractors, complex backgrounds, and diverse tasks, where IB leads to more noticeable improvements.
Comment 3 on Experiments: No testing of out-of-distribution generalization.
- Object position. We modify the positions of manipulated objects in MetaWorld of Cortexbench and perform 25 rollouts during testing.
| Method | Assembly | Bin-Picking | Button-Press | Drawer-Open | Hammer | Avg |
|---|---|---|---|---|---|---|
| VC1 | 65.33±6.11 | 62.67±8.33 | 66.67±2.31 | 100.00±0.00 | 94.67±6.11 | 77.87 |
| VC1+IB | 69.33 ± 12.86 | 76.00±4.00 | 70.67±8.33 | 100.00±0.00 | 96.00±4.00 | 82.40 |
- Generalization on unseen tasks with more distractors. Refer to Comment 4.
Comment 4 on Experiments: Limited/unconvincing real-world evaluations.
Due to word limit, please refer to our response to Comment 8 from reviewer xKwq. The videos are in anonymous material.
Comment 5 on Method: In Figure 2.c, why does only the [CLS] embedding go into the policy head?
We follow the common practice in transformer-based architectures, where the [CLS] token serves as a global summary of the input sequence, as seen in prior works [4-6]. The [CLS] embedding aggregates information through attention layers, making it ideal for downstream tasks. It is then passed to the policy head to represent the fused multimodal information. This design is computationally efficient and empirically effective, as shown in prior vision-and-language modeling works.
Comment 6 on Writing: Explaining the transition from I(Z;A) to the regression loss in Equation (6) more clearly would improve the presentation.
We will revise the manuscript to explain this connection more clearly.
Thank you for your time and effort in reviewing our work. We will revise the paper based on the rebuttal to better present our work.
Reference
Please refer to Reviewer a6Qr's reference list.
This paper proposes an information bottleneck approach to learn policies via BC for robot manipulation. To achieve this the paper uses the mutual information via neural estimation (MINE) objective and uses that to add an additional mutual information based regularization loss during training. Theoretical results are presented but these are just mostly derived from prior works. Experiment results are presented in two domains and shows that adding the regularization loss slightly improves performance on simulation based benchmark tasks.
给作者的问题
论据与证据
I do not think all claims are properly evaluated. First, as can be seen in the title, the paper talks about “Rethinking Latent Representations.” But I am a bit lost on how we are doing any re-thinking. The paper for the most part simply applies the information bottleneck objective during BC. There are some slight performance improvements and the paper claims that this shows that compressing latent space is essential. However, it is hard for me to agree with this claim. Slight improvements in success rate do not really show that the representations are more compact and more robust. Most of the tasks used in the paper have very little environment variation hence a lower-dim representation (or additional regularization) would probably be sufficient to complete the task and boost the success rate, but this does not really tell us much about the redundancy in the inputs or representations. This can also be seen in the numbers in Table 1 (if we remove the Metaworld tasks) the improvements in all the other tasks is very little often < 1%. The variance of the base result is already pretty high. So I am quite unsure about the very strong claims being made in the paper.
I am also a bit unsure about the multi-task results. The LIBERO benchmark while useful is not really a challenging multi-task benchmark since it has little task instruction diversity. But the robust representations and information bottleneck are both more interesting in multi-task settings, since the fusion of representation (across vision and language) would result in more complex and interesting representations.
方法与评估标准
The method and evaluations make sense. However, the evaluations are rather limited and it’s hard to conclude some of things that the paper concludes (please see above).
理论论述
The theoretical claims are almost all derived from previous works. There isn't much novelty in them besides just applying them to robotics but since they have already been applied to supervised learning problems it isn't clear what is new here.
实验设计与分析
I looked through and tried to carefully analyze both single-task (cortex-v1) and multi-task results (libero).
补充材料
no
与现有文献的关系
The paper applies the well known information bottleneck principle to behavior cloning training for robot manipulation. In terms of novelty and contribution this is a straight forward extension of many prior works, just that the domain (robot manipulation) is different. There are may other papers that have focused on using information bottleneck for many different design choices in control tasks. These papers often focus on dm_control tasks which are slightly different than robot manipulation benchmarks but the principles are the same and they often discover similar things (e.g. applying IB in some interesting way can improve robustness/transfer/performance etc.) However, the lack of novelty could be alleviated with more extensive experiments and careful analysis. Unfortunately, the paper does not really do this, it simply uses the success rate criterion to show that IB uniformly improves representations (please see above).
[1] MeMo: Meaningful, Modular Controllers Via Information Bottlenecks [2] InfoBot: Transfer and Exploration via the Information Bottleneck [3] Learning Task-Driven Control Policies via Information Bottlenecks
遗漏的重要参考文献
no
其他优缺点
The paper is well written and easy to follow.
其他意见或建议
Please open link to anonymous material for easy reference.
Comment 1 on Novelty of Method: simply apply the IB in BC and other works had done it [1-3].
- Differences with [1-3]
(1) Papers [1] and [2] differ significantly from our idea of reducing mutual information to eliminate redundancy. Paper [1] loosely references the IB concept but lacks theoretical grounding, simply adding noise to control signals to shift reliance toward local perception. Paper [2] uses IB to identify critical states based on goal dependence, aiming to improve exploration rather than reduce redundancy.
(2) Paper [3] uses the IB to obtain task-relevant information for control, which is similar to our idea. We explain the differences in Point 2.
- Our contribution
(1) As shown in Figure 1 (a) of the anonymous material, traditional IB methods [3-6] typically apply the bottleneck to a single modality (e.g., images) and its latent features, with the information flow being O→Z→A. (2) In contrast, BC for robot manipulation is more complex than earlier control or single-modal tasks, requiring models to handle diverse, multimodal data and inevitably necessitating feature fusion. If strictly following the previous IB paradigm (as shown in Figure 2 of the anonymous material) would require constraining each input modality and its features individually, resulting in a pipeline that lacks elegance and scalability, is highly complex, and lacks the ability to capture cross-modal associations. (3) Thus, we propose redefining the "input" at the representation level and treating the features before the policy head as latent features, with the information flow being O→X→Z→A. The advantages of this approach are: 1) It allows us to apply redundancy reduction in a unified way across all modalities. 2) In practice, encoders are often frozen, making this approach more effective in such cases. 3) It better scales to a wider range of robotic algorithms.
Comment 2 on Novelty of Theory: Simply applying theoretical claims to robotics does not constitute much novelty.
As shown in Figure 1 of the anonymous material, we adapt Theorems 4.1 and 4.2 to support the robot task with information flow of O→Z→A. To demonstrate that our paradigm remains compatible with these theorems, we present Theorem 4.3: Even if we optimize I(X;Z) by applying the bottleneck at an intermediate feature level X, as long as X preserves the essential structure of the original input o, we are effectively controlling I(O;Z), with the difference bounded by a small constant δ.
Comment 3: "Rethinking Latent Representations" is confusing.
Thank you for the feedback. We agree that the original title "Rethinking Latent Representations" may be confusing, and we will revise it to "Rethinking Latent Redundancy" in the revised version.
Comment 4: Slight improvements in success rate. For example, in Table 1 (if we remove the MetaWorld tasks), the improvements in all the other tasks are very small, often less than 1%.
Due to word limit, please refer to our response to Comment 2 from reviewer PrSW.
Comment 5: Slight improvements in success rate do not really show that the representations are more compact and more robust. And this paper does not really tell us much about the redundancy.
As discussed in Comment 4, Trifinger contains minimal redundancy, limiting IB’s impact (less than 1% gain). In contrast, LIBERO-Goal includes more clutter and complex backgrounds, where IB yields a larger improvement (~8%). This observation suggests that the representations in certain environments, especially those with more visual distractors and complex backgrounds, tend to contain higher redundancy, allowing IB to be more effective.
As shown in the attention map (Figure 3 in anonymous material), IB helps the model focus on task-relevant regions (e.g., the arm and target object) by suppressing attention to redundant background features—an effect less evident without redundancy reduction, highlighting IB’s distinct role beyond standard regularization.
Comment 6: LIBERO benchmark is not really a challenging multi-task benchmark since it has little task instruction diversity.
While LIBERO may have limited instruction diversity, it is a widely used benchmark in goal-conditioned manipulation and is sufficiently challenging for our study. The perceptual redundancy present in LIBERO allows IB to demonstrate clear benefits, as reflected in the significant performance gains observed in our experiments.
Furthermore, We conduct more challenging real-world experiments. Due to word limit, please refer to our response to Comment 8 from reviewer xKwq. The videos are in anonymous material.
Thank you for your time and effort in reviewing our work. We will revise the paper based on the rebuttal to better present our work.
Reference
Please refer to Reviewer a6Qr's reference list.
This manuscript explores latent representations in behavior cloning for robot manipulation by applying the Information Bottleneck principle to address redundancy in learned representations. It introduces an innovative perspective by applying mutual information to quantify and mitigate redundancy, supported by theoretical analysis and extensive empirical evaluation on CortexBench and LIBERO benchmarks.
给作者的问题
See above.
Basically, I think the authors could include more established baselines and serve IB theory as a plugin module to see if the proposed method could improve the existing methods rather than design baselines all by themselves.
论据与证据
The authors claim that incorporating Information Bottleneck substantially improves Behavior Cloning (BC) performance across various settings and benchmarks. While some performance improvements are shown, this claim is overstated. Particularly, the improvements in LIBERO multi-task benchmarks appear inconsistent, and the results on the LIBERO-Long tasks are relatively marginal (e.g., just around 2%~ improvement), thus casting doubt on the effectiveness and generality of the IB approach in complex multi-task settings.
The authors claim that the proposed IB approach achieves robustness and stable improvements across different scenarios and methods. This claim is problematic. The LIBERO benchmark results clearly reveal that performance gains are uneven, sometimes marginal.
方法与评估标准
The proposed methods and evaluation benchmarks are appropriate for addressing redundancy in latent representations. The adoption of the Information Bottleneck is justified, and the selected benchmarks align effectively. However, the manuscript lacks sufficient empirical justification for the claim that redundancy is the primary limitation affecting current behavior cloning approaches, and it doesn't thoroughly discuss why IB underperforms in complex, long-horizon tasks. The authors should strengthen the manuscript by clearly demonstrating the specific negative impacts of redundancy and providing a systematic sensitivity analysis of key hyperparameters.
理论论述
No explicit mathematical errors or mistakes in the proofs were found.
实验设计与分析
I don't know if I missed something, but I really wonder why the performance on LIBERO is that low. The numbers provided in the paper are like in the last century. Current BC methods like openvla or diffusion policy (trained from scratch) can achieve promising performance, and I think the authors should consider how to apply the IB theory to them. If the training split is different from theirs, the authors could consider using theirs instead.
补充材料
Yes.
与现有文献的关系
The manuscript makes itself within the broader context of representation learning in BC for robotic manipulation, specifically emphasizing the Information Bottleneck (IB) theory. The authors effectively situate their work within existing research.
遗漏的重要参考文献
The manuscript cites VC-1, R3M, and Voltron, yet fails to deeply contextualize IB in relation to recent advances in large-scale pretrained vision-language-action models like RT-2 and OpenX-Embodiment, which are increasingly central to robot imitation learning and generalization. The manuscript would benefit significantly by clearly differentiating IB's effectiveness from these existing approaches.
Brohan, A., et al. (2023). RT-2: Vision-language-action models transfer web knowledge to robotic control.
其他优缺点
Please see above.
其他意见或建议
The Supplementary Material shows some real-world examples, which is nice to check. However, the case is not complex enough to demonstrate the method. I recommend the authors to show more diverse examples.
Thanks for your detailed review of this paper! We hereby address your concerns as follows.
Comment 1 on Writing: The claimed improvements from incorporating IB are overstated.
Thanks for the feedback. We will emphasize the consistency of improvements rather than their magnitude in the revised manuscript.
Comment 2 on Experiment Performance: Why is the performance on LIBERO so low, and why does IB underperform in complex, long-horizon tasks?
The relatively low performance on LIBERO, especially on the LIBERO-Long, is attributed to the following factors:
- Differences in data processing:
In OpenVLA, the image observations are saved at a resolution of 256×256 (instead of 128×128) with additional filtering (e.g., removing "no-op" (zero) actions and unsuccessful demos). We use the raw LIBERO data with lower-resolution images without filtering.
- Differences in model capacity:
OpenVLA has around 7B parameters, and diffusion policies may have around 100M parameters. Our models such as BC-ViLT+IB contain only around 10M parameters.
As a result, our models are less effective in complex, long-horizon tasks like LIBERO-Long.
Comment 3 on Method: The performance gains on LIBERO are uneven, raising concerns about the effectiveness of the approach.
This limitation arises not from the IB theory itself, but from the mismatch between model capacity and task complexity. Our lightweight baseline models perform well on simpler tasks like LIBERO-Goal, but their limited capacity constrains performance on more complex tasks like LIBERO-Long. As a result, the potential gain from incorporating IB is limited. We illustrate the issue with experiments on LIBERO-Long in Comment 5. (Due to word limit, please refer to our response to Comment 2 from reviewer PrSW for further discussion.)
Comment 4 on Essential References: This paper does not adequately contextualize IB in relation to recent advances in large-scale pretrained vision-language-action models like RT-2[1] and OpenX-Embodiment.
First, we clarify that VLAs like RT-2[1] and OpenVLA[2] are mentioned in the introduction to highlight our motivation, and we also discuss them in the limitations section. Second, our work focuses on improving generalization within a given model architecture by optimizing the training paradigm through information compression, rather than designing a new large model. The IB method can also be incorporated into VLA-based models.
Comment 5 on Baselines: Include more established baselines rather than design baselines all by themselves.
We include two more baselines. OpenVLA[3] (4 A100 GPU, 80K steps, 32 batch size) and Difusion Policy (The MLP head (1.14M parameters) of BC-Transformer is replaced with the DP head(90M)). Each task is rolled out 20 times during testing.
| Method | LIBERO-Long |
|---|---|
| OpenVLA | 51.5 |
| OpenVLA+IB | 54.0 |
| DP | 78.0 |
| DP+IB | 84.0 |
Rationale for baselines: Our most baselines are consistent with previous works like[4,5]. Our custom baseline, BC-MLP, complements the experimental setup by introducing a spatial fusion approach, as the first three baselines rely on temporal fusion.
Comment 6 on Experiments: The manuscript lacks sufficient empirical evidence to support the claim that redundancy is the primary limitation.
The attention map in Figure 3 of anonymous material shows that without redundancy reduction, the model focuses on areas beyond the robotic arm and target object.
Comment 7 on Experiments: Providing a systematic sensitivity analysis of key hyperparameters.
The complete experimental hyperparameters are in Appendix Tables 3 to 5. All IB-related hyperparameters, except for the Lagrange multiplier β, remain consistent across experiments. The range of β is reported in Finding 8, with exact values provided in Appendix Tables 6 to 9.
Comment 8 on Experiments: More diverse real-world examples.
We conduct more challenging real-world experiments focusing on:
- Increased distractor objects and random positions.
- Training and evaluating a multi-task policy.
- Investigating generalization ability (*denotes unseen, which refers to instances not observed during the SFT stage).
We use 8 A100 GPUs to train both CogAct[3] and CogAct+IB for 8K steps on our newly collected dataset. The batch size is 128. Each task is tested 10 times. Videos are in anonymous material.
| Method | Put corn into blue bowl | Put corn into red bowl | Put corn into green bowl* | Put wood block into blue bowl | Put wood block into red bowl | Put wood block into green bowl* | Avg |
|---|---|---|---|---|---|---|---|
| CogAct | 80 | 60 | 20 | 100 | 60 | 40 | 60 |
| CogAct+IB | 100 | 80 | 60 | 80 | 60 | 40 | 70 |
Thank you again for your time and effort in reviewing our work. We will revise the paper based on the rebuttal to better present our work.
Reference
Please refer to Reviewer a6Qr's reference list.
03.04.2025 Update:
I clearly see the improvement and really appreciate the effort made to address the issues. I encourage the authors to include these results and refactor the paper largely to make your proposed module a general plug-in-and-play module. I would like to raise my ratings.
Thanks for your detailed rebuttal. I love seeing the additional experiments that you provided in such a short period!
Could you provide more details on how you modify openvla and Diffusion Policy? Did you train them from scratch? DP also shows a higher success rate than I expected. I have the environment to run trials by chance. I would appreciate it if you could provide the code and the checkpoints in an anonymous link so that I can learn more. Since the time is limited, you do not have to clean the code.
Thanks for your rebuttal time again!
04.04.2025 update: Thank you for your reply! We will carefully revise the main paper based on the content of the rebuttal.
Thanks for your reply. We address your concerns as follows.
Comment 1 on Baseline: Could you provide more details on how you modify OpenVLA and Diffusion Policy?
Yes, we can provide the code and checkpoints for repreducing. Each of authors made significant contributions to the rebuttal, dedicating considerable time and effort to supplementing the experiments.
Uploading the checkpoints has been quite challenging due to the large file sizes, with OpenVLA (28GB) and DP.tar.gz (~2.5GB). Transferring them to the cloud is both time-consuming and unstable. However, we made a considerable effort and dedicated significant time to complete this for your review.
Experiments on OpenVLA
Our code regarding OpenVLA is modified directly in the source code. OpenVLA is fine-tnuned (the base model is openvla-7b) on 4 A100 GPUs with 80K steps. The batch size is 32 (8*4). The learning rate is 5e-4. The setup for OpenVLA+IB is the same as for OpenVLA (Except IB-related hyperparameters). For IB-related hyperparameters, Lagrange multiplier β is 1e-4.
Anonymous OpenVLA Code: OpenVLA Code Link
Anonymous OpenVLA Checkpoints (click files and versions):
OpenVLA: OpenVLA Checkpoints Link
OpenVLA+IB: OpenVLA+IB Checkpoints Link
If you wanna know more on OpenVLA, please refer to these:
-
OpenVLA train: (1) ./vla-scripts/finetune_libero.py (2) ./vla-scripts/finetune_libero_ib.py
-
OpenVLA run:
train: bash fine-tune_libero.sh or bash fine-tune_libero_ib.sh
test: bash test_libero_1.sh
Experiments on DP
Our code related to DP is modified in our source code, as migrating DP to the OpenVLA codebase would have been too time-consuming. We referred to the diffusion policy head from CogAct and migrated it to our codebase. We train DP from scratch on a single A100 GPU with a batch size of 64, a learning rate of 1e-4, and 50 epochs. The setup for DP+IB is the same as for DP (Except IB-related hyperparameters). For IB-related hyperparameters, Lagrange multiplier β is 1e-5.
Anonymous DP Code: DP Code Link
Anonymous DP Checkpoints: DP Checkpoints Link
If you wanna know more on DP, please refer to these:
-
DP config: (1) ./libero_exp/configs/bc_policy/dp.yaml (2) ./libero_exp/configs/bc_ib_policy/dp.yaml (3) ./libero_exp/configs/base/policy/dp.yaml
-
DP model architecture: (1) ./libero_exp/models/bc_dp_policy.py
-
DP training: (1) ./libero_exp/algos/bc_policy.py (2) ./libero_exp/algos/bc_ib_policy.py
-
DP run:
train: bash libero_exp/scripts/main_libero_all.sh 'libero_10' 'bc_policy' 'dp' 0.9 or bash libero_exp/scripts/main_libero_all_2.sh 'libero_10' 'bc_ib_policy' 'dp' 0.9
test: bash libero_exp/scripts/eval_libero_all.sh False
If we have addressed your concerns, would you kindly consider giving us a higher rating? We would greatly appreciate it!
The manuscript investigates the redundancy in representations for Behavior Cloning in robot manipulation and introduces the Information Bottleneck principle to mitigate this issue. By incorporating IB, we aimed to filter out redundant information in latent representations while preserving task-relevant features.
给作者的问题
- There are a total of 9 Findings in the paper. Some of the findings are not particularly striking; it is recommended to refine them further and present the most significant conclusions as key findings. Others may be discussed as general experiment results, for example, Finding 9.
- "Finding 3: Finding 2 implicitly suggests that the latent representation Z derived from input X is redundant. Therefore, compressing information from input is essential, which can further enhance performance." Can it be understood that feature learning itself is crucial for action decoding, and that the redundancy of information between X and Z can be resolved through optimized feature learning?
论据与证据
The manuscript is articulated clearly, with well-defined supporting provisions and clear conclusions.
方法与评估标准
Yes, proposed methods and evaluation criteria make sense for the problem. Theoretical analysis enhances the credibility of the algorithm.
理论论述
They seem no problem.
实验设计与分析
The experimental validation is thorough, and the results effectively substantiate the experimental hypotheses.
补充材料
The supporting materials, which include theoretical proofs of the theorem and supplementary experiments, serve as an excellent complement to the main text.
与现有文献的关系
no
遗漏的重要参考文献
no
其他优缺点
see comments for authors.
其他意见或建议
see comments for authors.
Thanks for your affirmative review of this paper! We address your questions as follows.
Comment 1 on Writing and Organization: Some of the less important findings (e.g. finding 9) are unnecessarily highlighted.
We sincerely appreciate the reviewer’s feedback on refining the presentation of our findings.
Our intention in structuring the findings as they are was to provide a comprehensive and systematic analysis of both the impact and potential applications of the IB in BC. For Finding 9, in robot manipulation, it is common practice to fine-tune a generalist model on domain-specific data to better adapt to new tasks. Our intention was to emphasize the practical value of our method in this context, showing that it can be effectively integrated into such fine-tuning processes commonly encountered in real-world scenarios.
That said, we acknowledge the reviewer’s concern and will revise the manuscript accordingly by highlighting only the most essential findings and integrating more general observations into the discussion section to improve focus and readability.
Comment 2 on Theoretical Clarification: Finding 2 implies that the latent representation Z derived from input X is redundant, making information compression essential for performance improvement. Can it be understood that feature learning is key for action decoding, and that redundancy between X and Z can be addressed through optimized feature learning?
- Yes, feature learning itself is crucial for action decoding.
Most existing approaches aim to learn a generalizable representation and typically focus on scaling up data and model capacity. For example, many recent methods leverage larger datasets to train larger models, such as vision-language-action (VLA) models [1]. Fundamentally, these methods aim to learn more generalizable features Z by ensuring that the input space X covers a sufficiently broad distribution in the mapping X→Z.
Others incorporate additional sources of information to guide the learning process—for instance, CoTPC [2] introduces richer textual context, and ATM [3] integrates visual information from robot trajectories to guide robotic actions. These methods essentially introduce additional information to guide or constrain the learning of Z.
In contrast, our approach takes a different perspective. Instead of adding more information, we focus on reducing redundancy in the input to learn more compact and task-relevant representations.
- Yes, the redundancy between the input X and the latent representation Z can be mitigated through optimized feature learning.
For instance, we use I(Z;X) to measure the amount of information transferred from X to Z, and aim to minimize I(Z;X) while ensuring that I(Z;A) increases, in order to ensure that the information being reduced is redundant. This approach is essentially based on the IB framework. It encourages the model to retain only task-relevant information, while discarding irrelevant or redundant information, thereby reducing overfitting and improving generalization (as illustrated in Figure 3 in Link).
Finally, thank you again for your time and effort in reviewing our work. We will revise the paper based on the rebuttal to better present our work.
Reference
Reviewer a6Qr
[1]Rt-2: Vision-language-action models transfer web knowledge to robotic control. [2]Chain-of-thought predictive control. [3]Any-point trajectory modeling for policy learning.
Reviewer xKwq
[1]RT-2: Vision-language-action models transfer web knowledge to robotic control. [2]Openvla: An open-source vision-language-action model. [3]Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. [4]Spa: 3d spatial-awareness enables effective embodied representation. [5]Learning manipulation by predicting interaction.
Reviewer H7Dt
[1]MeMo: Meaningful, modular controllers via unformation bottlenecks. [2]Infobot: Transfer and exploration via the information bottleneck [3]Learning task-driven control policies via information bottlenecks [4]Learning representations for neural network-based classification using the information bottleneck principle. [5]Multi-view information-bottleneck representation learning. [6]Protecting your llms with information bottleneck.
Reviewer PrSW
[1]Learning representations for neural network-based classification using the information bottleneck principle. [2]Multi-view information-bottleneck representation learning. [3]Protecting your llms with information bottleneck. [4]Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. [5]CALAMARI: Contact-aware and language conditioned spatial action MApping for contact-RIch manipulation. [6]Predictive inverse dynamics models are scalable learners for robotic manipulation.
This paper introduces the Information Bottleneck (IB) principle to reduce redundancy in representations for behavior cloning in robotic manipulation. It received two Accepts, one Weak Accept, and one Weak Reject after the rebuttal. The initial main concerns raised by reviewers centered around the novelty of the approach, the limited performance improvements, and the narrow evaluation scope. While the IB principle itself is not new, its application to robotic manipulation constitutes a novel contribution. The authors provided additional explanations and results in the rebuttal, including stronger performance with more recent robot models and improved real-world experiments. Given the clarified contributions and strengthened empirical results, the AC recommends acceptance.