Hao Fei

Postdoc@University of Oxford·英国·OpenReview

研究方向

Multimodal Learning · Natural Language Processing · Generative AI · Large Language Model · Vision-Language Learning · Multimodal Large Language Model · Affective Computing · Video Generation · World Model · Multimodal Reasoning

Hao Fei

LogicReward: Incentivizing LLM Reasoning via Step-Wise Logical Supervision

Synergizing Understanding and Generation with Interleaved Analyzing-Drafting Thinking

JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

So-Fake: Benchmarking Social Media Image Forgery Detection

Unveiling the Cognitive Compass: Theory-of-Mind–Guided Multimodal Emotion Reasoning

JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation

A Reason-then-Describe Instruction Interpreter for Controllable Video Generation

Towards Explainable Bilingual Multimodal Misinformation Detection and Localization

Interpreting Any Condition to Caption for Controllable Video Generation

AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

UniVA: Universal Video Agents towards Next-Generation Video Intelligence

When Disagreements Elicit Robustness: Investigating Self-Repair Capabilities under LLM Multi-Agent Disagreements

SMAP: Self-supervised Motion Adaptation for Physically Plausible Humanoid Whole-body Control

On Path to Multimodal Generalist: General-Level and General-Bench

MuSLR: Multimodal Symbolic Logical Reasoning

JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation

$\mathcal{V}ista\mathcal{DPO}$: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

Watch Out Your Album! On the Inadvertent Privacy Memorization in Multi-Modal Large Language Models

Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought

VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models

CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs

Probing then Editing Response Personality of Large Language Models

Towards Semantic Equivalence of Tokenization in Multimodal LLM

Grounding is All You Need? Dual Temporal Grounding for Video Dialog