影响力指数

95.75/100

前 0.2%

全站排名 #151

发表论文68 篇

平均评分5.4

年均产出22.7 篇/年

Ge Zhang

Researcher@ByteDance Inc.·中国·OpenReview

研究方向

Natural Language Processing · Information Retreival · Recommender System

In-Place Test-Time Training

FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction

ICLR 2026Poster

Reformulation for Pretraining Data Augmentation

ICLR 2026Poster

YuE: Scaling Open Foundation Models for Long-Form Music Generation

ICLR 2026Poster

WideSearch: Benchmarking Agentic Broad Info-Seeking

ICLR 2026Poster

Reverse-Engineered Reasoning for Open-Ended Generation

ICLR 2026Poster

ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding

ICLR 2026Poster

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

ICLR 2026Poster

FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning

ICLR 2026Poster

Flash-Searcher: Fast and Effective Web Agents via DAG-Based Parallel Execution

ICLR 2026Poster

TaskCraft: Automated Generation of Agentic Tasks

ICLR 2026Poster

TreePO: Enhancing Policy Efficacy and Inference Efficiency with Tree Modeling

ICLR 2026Rejected

Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?

ICLR 2026Poster

FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models

ICLR 2026Desk Rejected

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

ICLR 2026Poster

P2P: Automated Paper-to-Poster Generation and Fine-Grained Benchmark

ICLR 2026Poster

A$^2$FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning

ICLR 2026Poster

VeriWeb: Verifiable Long-Chain Web Benchmark for Agentic Information-Seeking

ICLR 2026Rejected

DiscoX: Benchmarking Discourse-Level Translation in Expert Domains

ICLR 2026Poster

Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL

ICLR 2026Rejected

Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation

ICLR 2026Rejected

ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems

ICLR 2026Poster

OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs

ICLR 2026Poster

AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection

ICLR 2026Rejected

Beyond Score: A Multi-Agent System to Discover Capability and Behavioral Weaknesses in LLMs

ICLR 2026Withdrawn

Towards Personalized Deep Research: Benchmarks and Evaluations

ICLR 2026Poster

MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents

ICLR 2026Withdrawn

Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures

ICLR 2026Rejected

SciDA: Scientific Dynamic Assessor of LLMs

ICLR 2026Rejected

Audio-FLAN: An Instruction-Following Dataset for Unified Understanding and Generation of Speech, Music, and Sound

ICLR 2026Rejected

VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation

ICLR 2026Withdrawn

CryptoX : Compositional Reasoning Evaluation of Large Language Models

ICLR 2026Withdrawn

First return, entropy-eliciting explore

ICLR 2026Rejected

MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

ICLR 2026Withdrawn

VideoScore2: Think before You Score in Generative Video Evaluation

ICLR 2026Withdrawn

COIG-Writer: A High-Quality Chinese Creative Writing with Thought Process Dataset

ICLR 2026Withdrawn

LPFQA: A Long-Tail Professional Forum-based Benchmark for LLMs' Evaluation

ICLR 2026Withdrawn

KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation

NeurIPS 2025Spotlight

KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks

ICLR 2025Poster

FlexWorld: Progressively Expanding 3D Scenes for Flexible-View Exploration

NeurIPS 2025Poster

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark for Large Language Models

ICLR 2025Poster

VCR: A Task for Pixel-Level Complex Reasoning in Vision Language Models via Restoring Occluded Text

ICLR 2025Poster

McEval: Massively Multilingual Code Evaluation

ICLR 2025Poster

MuPT: A Generative Symbolic Music Pretrained Transformer

ICLR 2025Poster

General-Reasoner: Advancing LLM Reasoning Across All Domains

NeurIPS 2025Poster

LIME: LESS IS MORE FOR MLLM EVALUATION

ICLR 2025Rejected

OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

ICLR 2025Poster

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

ICLR 2025Rejected

OmniBench: Towards The Future of Universal Omni-Language Models

ICLR 2025Rejected

MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models

ICLR 2025Poster

M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation

ICLR 2025Rejected

MIO: A Foundation Model on Multimodal Tokens

ICLR 2025Rejected

AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions

ICLR 2025Rejected

HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models

ICLR 2025Withdrawn

KARPA: A Training-free Method of Adapting Knowledge Graph as References for Large Language Model's Reasoning Path Aggregation

ICLR 2025Withdrawn

ING-VP: MLLMs Cannot Play Easy Vision-based Games Yet

ICLR 2025Rejected

Can MLLMs Understand the Deep Implication Behind Chinese Images?

ICLR 2025Withdrawn

合作者 (20)

Wangchunshu Zhou