Ming-Hsuan Yang

Senior Staff Research Scientist@Google DeepMind·美国·OpenReview

研究方向

object tracking · object segmentation · scene parsing · low-level vision · image synthesis · vision and language · vision and learning · image generation · multimodal large model

Ming-Hsuan Yang

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Video

Streaming Autoregressive Video Generation via Diagonal Distillation

Pursuing Minimal Sufficiency in Spatial Reasoning

SoCo: Progressive Spectrum Optimization for Large Language Model Compression

Multi-Object System Identification from Videos

OmniLens++: Blind Lens Aberration Correction via Large LensLib Pre-Training and Latent PSF Representation

Efficiently Disentangling CLIP for Multi-Object Perception

SceneAdapt: Scene-aware Adaptation of Human Motion Diffusion

Efficient Degradation-agnostic Image Restoration via Channel-Wise Functional Decomposition and Manifold Regularization

SUBench: Benchmarking Spatial Understanding in Vision-Language Models

Blockwise SFT for Diffusion Language Models: Reconciling Bidirectional Attention and Autoregressive Decoding

Scaling Laws for Deepfake Detection

Beyond the Shot: Rethinking Cinematography Understanding with Foundational Skill Evaluation

Structured Attention Matters to Multimodal LLMs in Document Understanding

No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images

InstaInpaint: Instant 3D-Scene Inpainting with Masked Large Reconstruction Model

RMP-SAM: Towards Real-Time Multi-Purpose Segment Anything

Restage4D: Reanimating Deformable 3D Reconstruction from a Single Video

RAPID Hand: Robust, Affordable, Perception-Integrated, Dexterous Manipulation Platfrom for Embodied Intelligence

DGS-LRM: Real-Time Deformable 3D Gaussian Reconstruction From Monocular Videos

MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

A Simple Approach to Unifying Diffusion-based Conditional Generation

Learning Spatial-Semantic Features for Robust Video Object Segmentation

HQGS: High-Quality Novel View Synthesis with Gaussian Splatting in Degraded Scenes

IllumiCraft: Unified Geometry and Illumination Diffusion for Controllable Video Generation

EA3D: Online Open-World 3D Object Extraction from Streaming Videos

HoliGS: Holistic Gaussian Splatting for Embodied View Synthesis

4KAgent: Agentic Any Image to 4K Super-Resolution

Ranking-aware adapter for text-driven image ordering with CLIP

RobuRCDet: Enhancing Robustness of Radar-Camera Fusion in Bird's Eye View for 3D Object Detection

OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities

Text Speaks Louder than Vision: ASCII Art Reveals Textual Biases in Vision-Language Models

Kitten: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities

Layout-your-3D: Controllable and Precise 3D Generation with 2D Blueprint

Customized Procedure Planning in Instructional Videos

Three-Dimensional Trajectory Prediction with 3DMoTraj Dataset

Gaga: Group Any Gaussians via 3D-aware Memory Bank

Hierarchical Information Flow for Generalized Efficient Image Restoration

Tex4D: Zero-shot 4D Scene Texturing with Video Diffusion Models

PredFormer: Transformers Are Effective Spatial-Temporal Predictive Learners

PrML: Progressive Multi-Task Learning for Monocular 3D Human Pose Estimation

VideoAlchemy: Open-set Personalization in Video Generation

HALO: Human-Aligned End-to-end Image Retargeting with Layered Transformations

RelationBooth: Towards Relation-Aware Customized Object Generation

Dynamic Pre-training: Towards Efficient and Scalable All-in-One Image Restoration