为您找到 52,807 篇相关研究
全部论文
为您找到 52,807 篇相关研究
23
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly in mathematics and programming tasks.
It is widely believed that, similar to how traditional RL helps agents to explore and learn new strategies, RLVR enables LLMs to continuously self-improve, thus acquiring novel reasoning abilities that exceed the capacity of the corresponding base models.
In this study, we take a critical look at the current state of RLVR by systematically probing the reasoning capability boundaries of RLVR-trained LLMs across diverse model families, RL algorithms, and math/coding/visual reasoning benchmarks, using pass@k at large k values as the evaluation metric.
While RLVR improves sampling efficiency towards the correct path, we surprisingly find that current training does not elicit fundamentally new reasoning patterns.
We observe that while RLVR-trained models outperform their base models at smaller values of $k$ (e.g., $k$=1), base models achieve higher pass@$k$ score when $k$ is large.
Moreover, we observe that the reasoning capability boundary of LLMs often narrows as RLVR training progresses.
Further coverage and perplexity analysis shows that the reasoning paths generated by RLVR models are already included in the base models' sampling distribution, suggesting that their reasoning abilities originate from and are bounded by the base model.
From this perspective, treating the base model as an upper bound, our quantitative analysis shows that six popular RLVR algorithms perform similarly and remain far from optimal in fully leveraging the potential of the base model.
In contrast, we find that distillation can introduce new reasoning patterns from the teacher and genuinely expand the model’s reasoning capabilities.
Taken together, our findings suggest that current RLVR methods have not fully realized the potential of RL to elicit genuinely novel reasoning abilities in LLMs. This underscores the need for improved RL paradigms—such as continual scaling and multi-turn agent-environment interaction—to unlock this potential.
NeurIPS 2025Oral
Yang Yue et al.35
ICLR 2025Oral
Lvmin Zhang et al.29
ICLR 2025Oral
Xiangyu Qi et al.13
We obtain an analytic, interpretable and predictive theory of creativity in convolutional diffusion models. Indeed, score-matching diffusion models can generate highly original images that lie far from their training data. However, optimal score-matching theory suggests that these models should only be able to produce memorized training examples. To reconcile this theory-experiment gap, we identify two simple inductive biases, locality and equivariance, that: (1) induce a form of combinatorial creativity by preventing optimal score-matching; (2) result in fully analytic, completely mechanistically interpretable, local score (LS) and equivariant local score (ELS) machines that, (3) after calibrating a single time-dependent hyperparameter can quantitatively predict the outputs of trained convolution only diffusion models (like ResNets and UNets) with high accuracy (median $r^2$ of $0.95, 0.94, 0.94, 0.96$ for our top model on CIFAR10, FashionMNIST, MNIST, and CelebA). Our model reveals a {\it locally consistent patch mosaic} mechanism of creativity, in which diffusion models create exponentially many novel images by mixing and matching different local training set patches at different scales and image locations. Our theory also partially predicts the outputs of pre-trained self-attention enabled UNets (median $r^2 \sim 0.77$ on CIFAR10), revealing an intriguing role for attention in carving out semantic coherence from local patch mosaics.
ICML 2025Oral
Mason Kamb et al.12
NeurIPS 2025Oral
Morris Yau et al.13
ICML 2025Oral
Clément Bonet et al.11
ICML 2025Oral
Lorenzo Lucchese et al.9
ICML 2025Oral
Amber Yijia Zheng et al.10
ICML 2025Oral
Aaditya K Singh et al.18
ICLR 2025Oral
Cheng Lu et al.18
NeurIPS 2025Spotlight
Qiujiang Jin et al.17
We consider the following question: given a submodular or supermodular set function $f:2^V \to \mathbb{R}$, how should one minimize or maximize its average value $f(S)/|S|$ over non-empty subsets $S\subseteq V$? This problem generalizes several well-known objectives including Densest Subgraph (DSG), Densest Supermodular Set (DSS), and Submodular Function Minimization (SFM). Motivated by recent applications [39, 31], we formalize two new broad problems: the Unrestricted Sparsest Submodular Set (USSS) and Unrestricted Densest Supermodular Set (UDSS) which allow negative and non-monotone functions.
Using classical results we observe that DSS, SFM, USSS, UDSS, and MNP are all equivalent under strongly polynomial-time reductions. This equivalence enables algorithmic cross-over: methods designed for one problem can be repurposed to solve others efficiently. In particular we use the perspective of the minimum norm point in the base polyhedron of a sub/supermodular function which, via Fujishige's results, yields the dense decomposition as a byproduct. Via this perspective we show that a recent converging heuristic for DSS, \textsc{SuperGreedy++} [15, 29], and Wolfe’s minimum norm point algorithm are both universal solvers for all of these problems.
On the theoretical front, we explain the observation made in recent work [39, 31] that \textsc{SuperGreedy++} appears to work well even in settings beyond DSS. Surprisingly, we also show that this simple algorithm can be used for Submodular Function Minimization, including for example that it can act as an effective minimum $st$ cut algorithm.
On the empirical front, we explore the utility of several different algorithms including Fujishige-Wolfe min-norm point algorithm for recent problems. We conduct over 400 experiments across seven problem types and large-scale synthetic and real-world datasets (up to $\approx 100$ million edges). Our results reveal that methods historically considered inefficient, such as convex-programming methods, flow-based solvers, and Fujishige-Wolfe’s algorithm, outperform state-of-the-art task-specific baselines by orders of magnitude on concrete problems like HNSN [39]. These findings challenge prevailing assumptions and demonstrate that with the right framing, general optimization algorithms can be both scalable and state-of-the-art for supermodular and submodular ratio problems.
NeurIPS 2025Spotlight
Elfarouk Harb et al.26
NeurIPS 2025Spotlight
Weijia Shi et al.19
NeurIPS 2025Spotlight
Jiajun Shi et al.19
Existing pretrained models for 3D mesh generation often suffer from data biases and produce low-quality results, while global reinforcement learning (RL) methods rely on object-level rewards that struggle to capture local structure details. To address these challenges, we present $Mesh-RFT$, a novel fine-grained reinforcement fine-tuning framework that employs Masked Direct Preference Optimization (M-DPO) to enable localized refinement via quality-aware face masking. To facilitate efficient quality evaluation, we introduce an objective topology-aware scoring system to evaluate geometric integrity and topological regularity at both object and face levels through two metrics: Boundary Edge Ratio (BER) and Topology Score (TS). By integrating these metrics into a fine-grained RL strategy, Mesh-RFT becomes the first method to optimize mesh quality at the granularity of individual faces, resolving localized errors while preserving global coherence. Experiment results show that our M-DPO approach reduces Hausdorff Distance (HD) by 24.6% and improves Topology Score (TS) by 3.8% over pre-trained models, while outperforming global DPO methods with a 17.4% HD reduction and 4.9% TS gain. These results demonstrate Mesh-RFT’s ability to improve geometric integrity and topological regularity, achieving new state-of-the-art performance in production-ready mesh generation.
NeurIPS 2025Spotlight
Jian Liu et al.19
NeurIPS 2025Spotlight
Ziqiao Peng et al.16
We resolve a 30-year-old open problem concerning the power of unlabeled data in online learning by tightly quantifying the gap between transductive and standard online learning. We prove that for every concept class $\mathcal{H}$ with Littlestone dimension $d$, the transductive mistake bound is at least $\Omega(\sqrt{d})$. This establishes an exponential improvement over previous lower bounds of $\Omega(\log \log d)$, $\Omega(\sqrt{\log d})$, and $\Omega(\log d)$, respectively due to Ben-David, Kushilevitz, and Mansour (1995, 1997) and Hanneke, Moran, and Shafer (2023). We also show that our bound is tight: for every $d$, there exists a class of Littlestone dimension $d$ with transductive mistake bound $O(\sqrt{d})$. Our upper bound also improves the previous best known upper bound of $(2/3) \cdot d$ from Ben-David et al. (1997). These results demonstrate a quadratic gap between transductive and standard online learning, thereby highlighting the benefit of advanced access to the unlabeled instance sequence. This stands in stark contrast to the PAC setting, where transductive and standard learning exhibit similar sample complexities.
NeurIPS 2025Oral
Zachary Chase et al.16
NeurIPS 2025Spotlight
Jang Hyun Cho et al.42
Recent studies have shown that the denoising process in (generative) diffusion models can induce meaningful (discriminative) representations inside the model, though the quality of these representations still lags behind those learned through recent self-supervised learning methods. We argue that one main bottleneck in training large-scale diffusion models for generation lies in effectively learning these representations. Moreover, training can be made easier by incorporating high-quality external visual representations, rather than relying solely on the diffusion models to learn them independently. We study this by introducing a straightforward regularization called REPresentation Alignment (REPA), which aligns the projections of noisy input hidden states in denoising networks with clean image representations obtained from external, pretrained visual encoders. The results are striking: our simple strategy yields significant improvements in both training efficiency and generation quality when applied to popular diffusion and flow-based transformers, such as DiTs and SiTs. For instance, our method can speed up SiT training by over 17.5$\times$, matching the performance (without classifier-free guidance) of a SiT-XL model trained for 7M steps in less than 400K steps. In terms of final generation quality, our approach achieves state-of-the-art results of FID=1.42 using classifier-free guidance with the guidance interval.
ICLR 2025Oral
Sihyun Yu et al.23
ICLR 2025Oral
Takeru Miyato et al.共 52,807 篇论文,第 1 / 2641 页