/10

Spotlight4 位审稿人

最低4最高4标准差0.0

ICML 2025

Overcoming Multi-step Complexity in Multimodal Theory-of-Mind Reasoning: A Scalable Bayesian Planner

Chunhui Zhang,Zhongyu Ouyang,Kwonjoon Lee,Nakul Agarwal,Sean Dae Houlihan,Soroush Vosoughi,Shao-Yuan Lo

提交: 2025-01-23更新: 2025-07-24

TL;DR

A scalable Bayesian Theory-of-Mind planner that breaks multimodal reasoning into stepwise belief updates and uses post-trained small LMs to guide larger ones, yielding a performance boost on ToM benchmarks.

摘要

关键词

Theory-of-mindneuro-symbolic conceptlong-form planning

评审与讨论

审稿意见

评分: 42025-03-07

This paper primarily aims to tackle multimodal ToM, though in practice, their main focus is still on complex multi-step reasoning tasks, with the multimodal aspect being somewhat secondary in their method.

They propose the "Weak-to-Strong Control" strategy, which modifies the probability distribution at the output layer of a small model to directly adjust the corresponding probability distribution of a large model. This allows them to enhance ToM reasoning capabilities without fine-tuning the large model, thereby reducing computational costs.

The experimental results support their claim.

给作者的问题

As I discussed above.

论据与证据

The authors claim that their "Weak-to-Strong Control" strategy enhances ToM reasoning while reducing computational costs, but their experiments only compare 405B + 8B control to an untrained 405B model, not to a task-trained 405B. This means they only prove performance improves under cost constraints, but not that their method is the best way to enhance ToM reasoning if cost were not a concern.
The authors demonstrate that their method achieves better generalization by outperforming direct inference in unseen environments such as outer space, ancient Egypt, fairy tales, the Wild West, and medieval Europe. However, these tasks remain structurally similar, primarily involving object-location-based reasoning. It remains unclear whether their approach would generalize equally well to other types of ToM tasks, such as inferring social relationships or tracking psychological shifts in negotiations, where reasoning dynamics differ significantly.

方法与评估标准

The proposed method is actually more similar to adapter tuning and LoRA-style approaches, which aim to achieve better performance by fine-tuning fewer parameters rather than performing full model fine-tuning. A more reasonable baseline for comparison should be these parameter-efficient fine-tuning methods. The paper itself feels somewhat inconclusive, as I don’t see a strong connection between the proposed method and multimodality ToM.

理论论述

Theorem 1 is mathematically valid and justifies that Weak-to-Strong Control allows the large model to approximate a post-trained model without fine-tuning, ensuring minimal divergence through controlled adjustments.

实验设计与分析

The experiments show that 405B + 8B control improves performance compared to an untrained 405B model, but without comparing to a fine-tuned 405B or LoRA/adapter tuning baselines, it's unclear if this is the best approach for improving ToM reasoning. Also, an ablation study is needed to confirm whether the improvement comes from Weak-to-Strong Control itself or just adding a small model.

补充材料

No supplementary material provided

与现有文献的关系

The paper builds on ToM reasoning and efficient fine-tuning methods by having a small model adjust a large model’s outputs instead of fine-tuning, but without comparing to LoRA, adapter tuning, or fine-tuned large models, it’s hard to tell if this is actually the best approach, and the multimodal angle feels more like an add-on than a core focus.

遗漏的重要参考文献

其他优缺点

The paper is well-structured.

其他意见或建议

The generalization tests focus only on object-location-based tasks, it would be more convincing if some other scenario such as social interactions or psychological reasoning can be added. But this might be hard, so if not able to, it is fine.

作者回复

2025-03-30

We sincerely appreciate Reviewer Dtd9’s valuable comments and suggestions.

Q1: Comparison with fine-tuned large LMs (e.g., 70B/405B)

A1:

Table D: https://anonymous.4open.science/r/response_tom-BD87/tableDEFGH.md

Directly fine-tuning a 405B model is practically infeasible for most institutions due to extreme GPU requirements (~50–64 Nvidia H100 GPUs). Therefore, we approximated this scenario using the more manageable Llama-3.1 70B model. In our paper, Tables 2 & 3 have demonstrated that the fully post-trained 70B LM performs worse than the proposed method (8B+70B, 4B-depth+70B, and 4B-width+70B). This suggests that our method not only reduces resource usage but also achieves superior generalization, aligning with recent literature showing generalization degradation after fine-tuning large LMs [1-3]. Results (Table D) on Llama-3.3 70B (widely considered equivalent to 3.1-405B by the community) further support this conclusion. (See also our detailed discussion in Q1-A1 to Reviewer DnoB).

Q2: Comparison with parameter-efficient fine-tuning (PEFT) such as adapters or LoRA.

A2:

Table E: https://anonymous.4open.science/r/response_tom-BD87/tableDEFGH.md

The proposed weak-to-strong control is fully orthogonal and complementary to PEFT techniques, i.e., we can combine our method with any PEFT technique. In fact, our small LMs are trained by LoRA, as described in L191-right. Table E further confirms our method's consistent effectiveness regardless of the PEFT choice for small LM.

Directly applying PEFT to large pretrained LMs performs worse than our method (Table E). As discussed above, our method avoids fine-tuning large LMs and thus preserves their pretrained mental/world knowledge [1-3], essential for generalization in multimodal ToM tasks.

Q3: If other scenarios such as social interactions or psychological reasoning can be added

A3:

Table A (same as Table F): https://anonymous.4open.science/r/response_tom-BD87/tableAB.md

We sincerely appreciate your recognition of the difficulty of testing nuanced social generalization. Inspired by your feedback, we expanded our evaluation using MuMA-ToM [4], a benchmark explicitly designed for multi-agent social interaction tasks involving belief inference, social goal inference, and belief-of-goal inference (Table A). Our model achieves competitive results to LIMP [4], the expensive GPT-4o-based SoTA method, demonstrating clear generalization across social relationship tasks. (See also our detailed discussion in Q2-A2 to Reviewer dLAb).

Q4: Connection / how the method addresses multimodal ToM challenges.

A4:
Multimodal ToM tasks uniquely require integrating implicit world knowledge and dynamic temporal mental-state reasoning, which significantly differ from standard STEM reasoning tasks. Our method addresses these challenges explicitly through two key innovations:

Leveraging the vast implicit knowledge encoded in scaled-up pretrained LMs (e.g., 405B parameters), crucial for understanding complex social interactions and environmental contexts.
Structuring dynamic belief inference explicitly using Bayesian Inverse Planning (BIP), a cognitive-science-based framework designed specifically for ToM.

This combination strategically bridges the gap between abstract multimodal contexts and nuanced mental-state reasoning, effectively resolving the multimodal complexity identified in our Figure 1.

Q5: Ablations clearly isolating Weak-to-Strong Control's contribution.

A5:
Tables G & H: https://anonymous.4open.science/r/response_tom-BD87/tableDEFGH.md

To clarify its impact, we demonstrate two ablation studies:

We compare with naïvely post-trained models. Results summarized in Table G (i.e., Tables 2, 3 & D) show consistent performance drops without weak-to-strong control. This demonstrates clearly that Weak-to-Strong guidance enhances the large LM’s ability to generalize by effectively leveraging its pretrained world and mental-state knowledge.
In Table H, when directly combining a small LM and a large LM (directly adding their logits) without our structured weak-to-strong adjustment, the performance is lower than our method. Thus, naïvely adding a small LM is suboptimal. Explicit weak-to-strong control is essential for abstracting the smaller LM’s specialized ToM behavior to leverage the large LM’s pretrained knowledge.

Overall, these ablations conclusively validate that our mechanism is both critical and independently responsible for our method’s demonstrated role in ToM grounding.

References

[1] Overtrained Language Models Are Harder to Fine-Tune, arXiv:2503.19206
[2] Understanding Catastrophic Forgetting in Language Models via Implicit Inference, ICLR 2024
[3] Spurious Forgetting in Continual Learning of Language Models, ICLR 2025
[4] MuMA-ToM: Multi-modal Multi-Agent Theory of Mind, AAAI-25 (Oral)

审稿人评论

2025-04-03

Thanks for the detailed response! I realise now that I had misunderstood a few things when I first read the paper. The additional explanations and experiments helped clear up my main concerns. I especially appreciate the added results — social interaction and psychological reasoning are particularly challenging, and I had initially assumed they were being avoided on purpose.

作者评论

2025-04-03

We sincerely thank Reviewer Dtd9 for your thoughtful reading, detailed consideration, and updated feedback. Your insightful comments have helped us clarify and strengthen our work—especially regarding challenging scenarios. We greatly appreciate your time and effort!

审稿意见

评分: 42025-03-16

This paper addresses the scalability limitation of Theory-of-mind (ToM) models in multi-modal environments. Predicting agents' goals and beliefs in complex mutli-modal environments involving vision and language requires visual understanding, multiple steps planning and reasoning, as well as extensive world knowledge. While LLMs in the order of hundreds of billions of parameters posses such capabilities, fine-tuning them to ToM tasks is expensive. To address this, the paper presents a weak-to-strong guidance framework where small LMs (in the order of 4-8B parameters) are fine-tuned on ToM tasks and guide a large LM to perform such tasks without further fine-tuning. This is achieved by modifying the LLM's predictions by the difference in predictions between the base small model and the same model after it was fine-tuned on ToM tasks. The authors conduct several experiments that show the efficacy of their approach on a plethora of models.

给作者的问题

Nothing to add.

论据与证据

The paper is very well written and the claims are supported by evidence.

方法与评估标准

The evaluation methods make sense.

理论论述

The proof of Theorem 1 seems correct to me.

实验设计与分析

The experimental design is sound.

补充材料

I read the proof of Theorem 1 and it seems correct.

与现有文献的关系

The paper addresses an important issue in the area of ToM modeling and provides a sound solution.

遗漏的重要参考文献

Prior work is properly discussed,

其他优缺点

Nothin to add.

其他意见或建议

Nothing to add.

作者回复

2025-03-30

We sincerely thank Reviewer RMFT for the positive and encouraging assessment, and for clearly recognizing the unique complexity in multimodal ToM we aim to address. The core philosophy of our weak-to-strong guidance BIP framework is precisely to leverage specialized small language models to efficiently guide large pretrained models, thus preserving extensive implicit world knowledge essential for multimodal ToM.

Your confirmation of our theoretical rigor, experimental soundness, and alignment with the broader literature greatly encourages us!

审稿意见

评分: 42025-03-16

This paper proposes a scalable Bayesian Planner that employs small models for stepwise Bayesian updates, refining the likelihood estimation of larger models. Experimental results demonstrate that this approach outperforms existing methods on multimodal Theory of Mind (ToM) benchmarks and generalizes well to unseen scenarios.

给作者的问题

I have some clarification questions:

How does post-training affect generalization in unseen scenarios? In Table 4, the 4B-depth and 8B+70B models outperform the 70B post-trained model. Does this suggest that post-training limits the generalization of the 70B model, or that the smaller post-trained model enhances the generalization of larger models?
What is the notation g1/g2 in Equation (2)?
What are the results on text-only benchmarks? Given the framework’s design, it should theoretically generalize well to text-based ToM tasks. Providing these results would strengthen the claim of generalizability.
What is your model’s performance on the scenarios depicted in Figure 1?

论据与证据

N/A

方法与评估标准

N/A

理论论述

N/A

实验设计与分析

N/A

补充材料

N/A

与现有文献的关系

The paper introduces a novel hierarchical modeling approach where specialized smaller models assist larger models, effectively balancing robustness and generalization while reducing the complexity of multistep training. The control mechanism in Equation (7) is particularly well-designed, allowing the language model to focus on differences between the base model and the post-trained model.

The experimental design is strong, including evaluations of models' performance over extended planning steps, tests on both text-only and multimodal benchmarks, and assessments in unseen scenarios.

遗漏的重要参考文献

N/A

其他优缺点

N/A

其他意见或建议

N/A

作者回复

2025-03-30

We sincerely thank Reviewer DnoB for the insightful comments and valuable suggestions.

Q1: Does direct post-training limit the generalization of the large LM, or does guidance from smaller post-trained LMs enhance generalization?

A1:
Thank you for raising this insightful question. Our experiments (Tables 2 & 3) suggest that directly post-training the large LM (70B) achieves suboptimal generalization. Specifically, the directly post-trained 70B LM performs worse than the proposed method (8B+70B, 4B-depth+70B, and 4B-width+70B).

This indicates that direct post-training on large LMs may inadvertently diminish their inherent generalization capabilities, likely due to partial overwriting of implicit world knowledge and mental-state reasoning abilities acquired during extensive pretraining. This interpretation aligns closely with recent literature on generalization degradation and catastrophic forgetting in extensively fine-tuned large LMs [1,2,3].

In contrast, our proposed weak-to-strong control addresses this by post-training only small LMs (4B or 8B), which subsequently guide the larger LM exclusively at inference time without modifying its pretrained weights. Thus, these smaller models function as specialized lightweight controllers that effectively enhance ToM reasoning without compromising the broader pretrained capabilities. This strategy enables our framework to effectively balance task-specific adaptation with robust generalization, yielding better performance on previously unseen scenarios. We will include this discussion in our revised paper.

References:

[1] “Overtrained Language Models Are Harder to Fine-Tune,” arXiv:2503.19206.
[2] “Understanding Catastrophic Forgetting in Language Models via Implicit Inference,” ICLR 2024.
[3] “Spurious Forgetting in Continual Learning of Language Models,” ICLR 2025.

Q2: $g_1/g_2$ in Eq (2)

A2:
In Equation (2), $g_1$ and $g_2$ represent two candidate goal hypotheses within our Bayesian inverse planning framework. Each hypothesis $H_i = \langle g_i, b_i^t \rangle$ comprises a goal $g_i$ and a corresponding belief $b_i^t$ , jointly providing potential explanations for the observed agent behavior. In our experimental setup (Sections 3 and 4), these hypotheses directly correspond to the provided multiple-choice answer options (e.g., options (a) and (b)). Equation (2) thus computes and compares the posterior likelihoods of these candidate hypotheses given observed states and actions. We will explicitly clarify this linkage near Equation (2) in the revised manuscript.

Q3: Results on text-only benchmarks to strengthen the claim of generalizability?

A3:

Table C: https://anonymous.4open.science/r/response_tom-BD87/tableC.md

Following your valuable suggestion, in Table C, we conducted additional experiments specifically evaluating our method on text-only Theory-of-Mind tasks (the same MMToM-QA benchmark but without visual or multimodal inputs). We are pleased to provide these new results and will include them in our revised paper.

In these text-only evaluations, despite lacking the fine-grained temporal state transitions provided by multimodal observations, our model maintains robust performance. As our smaller LM was post-trained on activity data derived from the multimodal ToM simulator, it effectively leverages familiarity with underlying scenarios to accurately infer high-level ToM states from text alone, thereby strongly supporting our claim of generalization.

Q4: Performance on scenarios in Figure 1

A4:

Following your recommendation, we conducted additional evaluations of our weak-to-strong control framework specifically on scenarios depicted in the updated Figure 1 (link).

Our model (8B+405B) consistently outperforms the 405B with CoT on tasks ranging from 1-step to 7-step planning. For tasks exceeding 8 planning steps, performance between methods converges. This convergence occurs because these highly complex multi-step tasks fall beyond the distribution of the post-training dataset. As a result, successful reasoning at higher complexities increasingly depends upon the intrinsic pretrained grounding capabilities of the large LM. We will include this discussion and the updated Figure 1 in our revised paper.

审稿意见

评分: 42025-03-18

The paper presents a Bayesian ToM method using stepwise belief updates and weak-to-strong LM transfer, unifying social and world knowledge to achieve 4.6% higher accuracy on multimodal tasks (including unseen settings) than prior approaches, resolving scalability/generalization trade-offs.

update after rebuttal

The authors have addressed most of my concerns. Although minor issues remain, the paper is satisfactory overall, and I lean toward an accept recommendation.

给作者的问题

Could you elaborate on why chain-of-thought models like Deepseek or GPT O3 were not experimented with in your approach, given their recent prominence?
What are the practical computational limitations of your Bayesian ToM planner in real-world, resource-constrained scenarios, and how might these affect its scalability?

论据与证据

Yes

方法与评估标准

Yes

理论论述

I verified the proof on decomposing ToM reasoning into stepwise Bayesian updates and the one on transferring ToM reasoning from smaller to larger language models. Overall, the proofs are sound, though some high-dimensional scalability assumptions need further clarification.

实验设计与分析

We reviewed the experimental design on multimodal benchmarks and the analysis reporting a 4.6% improvement. Overall, the setup is sound, but more details on data partitioning and confounding factors would be beneficial.

补充材料

I reviewed the entire supplementary material.

与现有文献的关系

This work bridges Bayesian cognitive modeling (via modular belief updates) and large language models (LLMs) to overcome scalability and generalization issues in traditional Theory-of-Mind methods. Its novel weak-to-strong knowledge transfer integrates social reasoning with real-world knowledge in LLMs, achieving state-of-the-art accuracy in complex, unseen multimodal scenarios.

遗漏的重要参考文献

N/A

其他优缺点

Strengths:

The Bayesian ToM planner decomposes complex theory-of-mind reasoning into stepwise Bayesian updates. This design enables effective scaling across different model sizes (from 7B to 405B parameters), overcoming the typical scalability issues found in previous methods.
The approach leverages a “weak-to-strong” control strategy by using smaller language models to refine ToM-specific likelihood estimates. These estimates are then integrated into larger models, which combines specialized reasoning with broader social and world knowledge.
The method demonstrates a 4.6% accuracy gain over state-of-the-art techniques on multimodal ToM benchmarks, including in unseen scenarios, suggesting tangible benefits in real-world applications.

Weaknesses:

Although the method is scalable, combining multiple models (smaller ones for likelihood estimation and larger ones for integration) may lead to increased computational complexity. This might limit its practicality for applications with strict real-time or resource constraints.
Although experimental results are promising, it remains uncertain how well the approach will perform across the diverse and nuanced landscape of real-world social interactions.
Why Not Use Chain-of-Thought Models Like Deepseek or GPT O3.

其他意见或建议

N/A

作者回复

2025-03-30

We sincerely thank Reviewer dLAb for their insightful comments and support.

Q1: Practicality under strict real-time/resource constraints?

A1:
Our method uses a small post-trained LM (4B/8B) to dynamically guide the large pretrained LM (70B/405B) at inference. Practically, both models comfortably fit on individual NVIDIA H100 GPUs (80GB, BF16 precision) and run in parallel with minimal synchronization (small likelihood tensors). For example, for the 8B+70B model, its inference for 600 tasks takes ~14–15.5 min (1.4–1.55s per question), nearly identical to a single unguided 70B model, as the extra computational overhead of the 8B model is almost negligible compared to the 70B model.

Furthermore, the large LM only performs likelihood estimation (prefilling ~1024 tokens), typically completed in ≤1 second per GPU [NVIDIA refs 1, 2]; correspondingly, the small LM requires only ~0.5 seconds per question as an individual. Such prefilling tasks are highly amenable to acceleration tools (e.g., NVIDIA Dynamo, vLLM), making our Bayesian ToM planner practically suitable even under resource constraints.

Additionally, our method avoids the costly fine-tuning of large LMs. For example, directly fine-tuning a 405B model is practically infeasible for most institutions due to extreme GPU requirements (~50–64 Nvidia H100 GPUs). In contrast, our method can fine-tune an 8B model to guide a 405B model, requiring only a single H100 GPU and achieving superior performance. We will include this discussion in our revised paper.

Q2: Generalization to diverse, nuanced real-world social interactions?

A2:

Table A: https://anonymous.4open.science/r/response_tom-BD87/tableAB.md

We evaluated generalization to diverse, complex unseen scenarios (Tables 4, 9 &10), covering Andersen fairy tales, Ancient Egypt, Outer Space, Wild West, and Medieval Castle. Results consistently show stable, robust generalization.

Additionally, inspired by your feedback, we expanded our evaluation using MuMA-ToM [3], a benchmark explicitly designed for nuanced social interaction, including:

Belief inference: Understanding environmental dynamics.
Social Goal inference: Interpreting subtle social objectives.
Belief-of-Goal inference: Attributing complex mental states.

Results in Table A show that our method performs competitively to the state-of-the-art GPT-4o-based LIMP [3] and outperforms all the other baselines. Note that this is achieved by using open-source models, avoiding the expensive GPT-4o API cost required by LIMP [3]. Our weak-to-strong control robustly leverages large pretrained LMs, effectively adapting to real-world social reasoning without compromising generalization. We will include these new results in our revised paper.

Q3: Why were prominent CoT models (Deepseek, GPT O3) not included?

A3:

Table B: https://anonymous.4open.science/r/response_tom-BD87/tableAB.md

Thank you for highlighting CoT models. We initially did not include Deepseek R1 (671B) or GPT O3-mini (released in Jan 2025), due to timing close to the submission due (also in Jan 2025). Additionally, our multimodal ToM tasks prioritize implicit world knowledge and nuanced mental-state reasoning, which differ fundamentally from CoT models' strength in explicit logical reasoning.

Following your valuable suggestion, we conducted new evaluations in Table B, which clearly show the performance ranking:
Our method > Deepseek R1 > GPT O3-mini.

This demonstrates that the core challenge in multimodal ToM tasks is the depth and breadth of implicit mental-state and world knowledge—areas where large pretrained representations excel over logic-specialized CoT models. Further, our cognitive-inspired BIP framework effectively mitigates the overthinking/hallucination pitfalls observed in specialized logical reasoning models. We will include these new results in our revised paper.

Q4: (minor) The setup is sound, but more details on data partitioning and confounding factors would be beneficial.

A4:
We provide some detailed dataset construction, partitioning strategies, and confound mitigation in Sec. 4.1 and App. D. Specifically, following MMToM-QA's setting, the training set is derived from 1,000 procedurally-generated videos annotated with structured sequences (states, goals, beliefs, actions). The test set uses 600 questions derived from other 134 videos, which are entirely disjointed environments and narratives. To enhance clarity, we will add more details to App. D, as well as explicitly reference App. D and the accompanying repository readme file in the revised main paper.

References:

[1] NVIDIA MLPerf AI Benchmarks. “Llama 2 70B: MLPerf Benchmark.”
[2] NVIDIA Technical Blog. “Boost Llama 3.3 70B Throughput 3x with TensorRT.”
[3] MuMA-ToM: Multi-modal Multi-Agent Theory of Mind, AAAI-25 (Oral).

最终决定Accept (spotlight poster)

2025-05-01

The reviewers unanimously found the work to be well-written, the claims supported by strong evidence, and the experimental design sound. The reviewer found the core method (particularly the weak-to-strong guidance which avoids costly fine-tuning of large LLMs while preserving their world knowledge) to be a strong contribution.

Reviewers raised several questions regarding computational cost, generalization beyond tested scenarios, comparison to alternative methods (CoT models, PEFT, full fine-tuning), and the precise contribution of the proposed control mechanism. The authors provided an exceptionally thorough and convincing rebuttal, and the reviewers generally found majority of concerns being resolved.

Given the paper's novelty, technical strength, compelling results (demonstrating state-of-the-art performance and generalization), strong reviewer consensus I strongly recommend acceptance.