影响力指数

72.54/100

前 2.2%

全站排名 #1,448

发表论文29 篇

平均评分5.4

年均产出9.7 篇/年

Xu Tan

Principal Researcher@Microsoft·中国·OpenReview

研究方向

Language · Speech and Audio · Text to Speech · Machine Translation · Speech Recognition · AI Music · Talking Face Synthesis

7.0

Audio-FLAN: An Instruction-Following Dataset for Unified Understanding and Generation of Speech, Music, and Sound

ICLR 2026Rejected

3.2

Alignment Does Matter: Enables Pure-Speech-Token Dialogue with Frozen Text LLMs

ICLR 2026Rejected

二作

7.2

ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling

ICML 2025Poster

6.8

Chain-of-Model Learning for Language Model

NeurIPS 2025Poster

三作

6.5

MuPT: A Generative Symbolic Music Pretrained Transformer

ICLR 2025Poster

6.4

MoonCast: High-Quality Zero-Shot Podcast Generation

NeurIPS 2025Poster

通讯

5.0

VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers

ICLR 2025Rejected

4.8

VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

ICLR 2025Withdrawn

4.0

GETMusic: Generating Music Tracks with a Unified Representation and Diffusion Framework

ICLR 2025Withdrawn

二作

3.7

Semantic-Aware Diffusion Model for Sequential Recommendation

ICLR 2025Withdrawn

3.5

Sparse Training: Do All Tokens Matter for Long Sequence Generalization?

ICLR 2025Withdrawn

-1

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

ICLR 2025Desk Rejected

合作者 (20)

Xu Tan

AudioX: A Unified Framework for Anything-to-Audio Generation

YuE: Scaling Open Foundation Models for Long-Form Music Generation

Vox-Infinity: Benchmarking the Limits of Long-Context Spoken Language Models

Audio-FLAN: An Instruction-Following Dataset for Unified Understanding and Generation of Speech, Music, and Sound

Alignment Does Matter: Enables Pure-Speech-Token Dialogue with Frozen Text LLMs

ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling

Chain-of-Model Learning for Language Model

MuPT: A Generative Symbolic Music Pretrained Transformer

MoonCast: High-Quality Zero-Shot Podcast Generation

VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers

VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

GETMusic: Generating Music Tracks with a Unified Representation and Diffusion Framework

Semantic-Aware Diffusion Model for Sequential Recommendation

Sparse Training: Do All Tokens Matter for Long Sequence Generalization?

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis