7.3

/10

Spotlight4 位审稿人

最低4最高5标准差0.5

3.0

置信度

创新性2.8

质量3.3

清晰度3.3

重要性2.8

NeurIPS 2025

Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles

Jiangjie Chen,Qianyu He,Siyu Yuan,Aili Chen,Zhicheng Cai,Weinan Dai,Hongli Yu,Jiaze Chen,Xuefeng Li,Qiying Yu,Hao Zhou,Mingxuan Wang

OpenReview PDF

提交: 2025-05-11更新: 2025-10-29

摘要

关键词

Reasoning ModelLarge Language ModelReinforcement LearningLogical Reasoning

评审与讨论

审稿意见

评分: 4置信度: 32025-06-30

This paper introduces ENIGMATA, designed to enhance Large Language Models' logical reasoning capabilities, particularly for solving puzzles. The ENIGMATA suite consists of three main components: ENIGMATA-Data, a dataset of 36 diverse puzzle tasks across seven categories, each equipped with an automated generator for scalable, difficulty-controlled examples and a rule-based verifier for automatic evaluation; ENIGMATA-Eval, a benchmark derived from this data; and ENIGMATA-Model, a training recipe combining rejection fine-tuning and multi-task Reinforcement Learning with Verifiable Rewards (RLVR). The authors demonstrate that their trained model, Qwen2.5-32B-ENIGMATA, achieves good performance on ENIGMATA-Eval, ARC-AGI, and out-of-domain puzzle benchmarks, while maintaining strong mathematical reasoning abilities. The work aims to provide a unified, controllable framework for advancing logical reasoning in LLMs.

优缺点分析

[Strengths]

ENIGMATA-Data is featuring 36 distinct puzzle tasks across 7 broad categories.

The proposed ENIGMATA-Model recipe, which combines rejection fine-tuning (RFT) to establish reasoning patterns and multi-task Reinforcement Learning (RL) for general reasoning skills, is good.

The trained Qwen2.5-32B-ENIGMATA model demonstrates good performance.

The model seems to exhibit good generalization to out-of-domain puzzle benchmarks (KOR-Bench) and maintain good performance in mathematical reasoning (AIME 24).

[Weaknesses] The paper admits that 6 of the 36 tasks draw from "fixed pools" and some instances "rely on manually collected and annotated data rather than auto-generation". Would this inconsistency undermine the claim of universal scalability and fully automated data generation/verification across the entire ENIGMATA suite?

Despite implementing multi-stage RL to mitigate task conflicts, the paper notes "more volatile response length under Multi-Stage RL suggests instability in generation". This indicates that the proposed training strategies might not fully resolve the inherent challenges of simultaneously optimizing for diverse reasoning skills.

The analysis explicitly states that "code utilization actually hindered performance on puzzle tasks" and that "current models fail to effectively use code without executing it for complex reasoning tasks" This probably reveals a technical limitation.

问题

Please refer to the weaknesses.

局限性

yes

最终评判理由

Thanks for the authors rebuttal, which addresses my most concerns. I keep my positive score.

格式问题

N.A.

作者回复

2025-07-30

Thank you for the encouraging and constructive feedback! We appreciate your recognition of the Enigmata benchmark, our proposed Enigmata-Model training paradigm, and the model's demonstrated generalization ability.

Here are the responses to your concerns

W1: Inconsistent Data Generation — 6 of 36 tasks rely on fixed pools or manual annotation. Would this undermine the claim of universal scalability and automation?

A1: Thank you for this question. We would like to clarify that the vast majority of Enigmata tasks (30 out of 36) are fully auto-generated with configurable parameters, controllable difficulty, and programmatic verification. These auto-generated environments cover all seven domains and form the core of our training and evaluation pipeline.

For the remaining six tasks that rely on fixed pools or partial manual annotation, we chose to include them because they exhibit unique structural challenges that are currently difficult to procedurally generate with high quality. These tasks still provide valuable diversity and difficulty for evaluating general reasoning, even if not fully automated. Rather than undermining scalability, we believe this design reflects a practical trade-off between automation and task complexity. We are actively working toward expanding auto-generation coverage for these remaining categories.

W2: Multi-Stage RL Training Instability. The paper notes that response length is more volatile under multi-stage RL, suggesting possible instability when optimizing diverse reasoning tasks.

A2: Thank you for raising this point. We would like to clarify that the observed variability in response length is not a sign of training failure, but rather a reflection of the different inductive behaviors of the two training paradigms analyzed in Section 4.3. We compare two strategies under the same total training budget:

Multi-Stage RL: better at acquiring task-specific skills, especially for complex tasks not seen in the SFT phase, such as ARC-AGI and Enigmata.
Mix-Training RL: shows superior generalization to OOD tasks, thanks to broader exposure and more diverse reasoning supervision.

W3: Code Utilization Fails to Improve Reasoning. The paper states that code utilization actually hindered performance, and current models fail to use code effectively without execution.

A3: Yes, and thank you for pointing this out. In our experiments, we investigated whether prompting models to write code (without execution feedback) would help in complex symbolic puzzles. We found that this did not improve and sometimes worsened performance, likely because without an actual interpreter or reasoning agent loop, models lacked feedback on correctness. This aligns with our broader hypothesis: current LLMs, when used purely as reasoning models, cannot fully leverage coding prompts without execution. We are actively working on extensions toward agent-based settings, where generated code can be executed, verified, and refined, leading to more meaningful gains. We believe Enigmata can serve as a strong resource for developing such RL-augmented or tool-augmented agents.

2025-08-06

Thanks for the authors rebuttal, which addresses my most concerns. I keep my positive score.

2025-08-04

We appreciate your valuable feedback and thoughtful comments. Since it is near the end of the discussion period, could you kindly let us know if we have managed to address your concerns? Should you have any further questions, kindly let us know. We are more than happy to have further discussions with you. Thanks

审稿意见

评分: 5置信度: 32025-07-01

The authors present a new dataset of puzzle tasks for the evaluation of Large Reasoning Models (LRM), along with an accompanying benchmark and training procedure for fine-tuning a model on the task. The dataset consists of 36 puzzle tasks across 7 high-level categories, and most tasks also feature a generator with a scalable difficulty parameter. The authors demonstrate that current state-of-the-art LRMs do not solve the ENIGMATA benchmark and also show that their fine-tuned open-weight model performs comparably to the best closed-weight models on the benchmark. The paper concludes with a deeper analysis of the fine-tuned model, including various ablation studies.

优缺点分析

This is a strong paper. The authors introduce a broad dataset of puzzle tasks and effectively demonstrate that it represents a reasonable challenge for current models. I’m also impressed with the thoroughness of the training procedure for ENIGMATA-Model and the accompanying experiments, which help immensely in understanding the effects of the various decisions / hyperparameters. My sense is that the benchmark could easily prove useful to the larger research community. However, I must also admit that I’m not the most knowledgeable about the various reasoning benchmarks that already exist -- it may be worth the authors providing additional explanation of why the ENIGMATA benchmark is a useful test suite in the context of other (potentially more challenging) evaluations like ARC-AGI 2.

I would also strongly encourage the authors to include some indication of the statistical significance of their experiments. In the checklist, the authors indicate that this information is present in Section 5.3 -- however, this section only concerns the ablation studies and does not mention anything about variance or statistical significance. The tables and figures also appear to be missing error bars / indications of variance. I don’t feel that this critique sinks the paper, given the scale of the other contributions, but I again would urge the authors to address it.

I also encourage the authors to cite this relevant paper on the automatic generation of programming puzzles:

[1] Pourcel, Julien, et al. "ACES: generating diverse programming puzzles with autotelic language models and semantic descriptors." Neurips (2024).

问题

How would you propose addressing the low performance of models in the Graph and Sequential categories? Is this something that can be solved with more data?
What is the main benefit of using the ENIGMATA benchmark over other similar reasoning benchmarks?

局限性

Yes

最终评判理由

The authors clarified my outstanding questions and I feel the paper remains a strong candidate for publication. As such, I am not changing my score.

格式问题

I did not notice any major formatting issues

作者回复

2025-07-30

Thank you for the positive and encouraging feedback! We are especially grateful for your recognition of our benchmark design, thorough experimentation, and the potential value of Enigmata to the broader research community.

Here are the responses to your concerns

W1: Benchmark Comparison – Why Enigmata over existing benchmarks like ARC-AGI 2? I'm not deeply familiar with all reasoning benchmarks—could the authors clarify why Enigmata is useful in the context of other (potentially more challenging) test suites?

A1: Thank you for the thoughtful question. We agree that ARC-AGI and similar benchmarks are valuable for evaluating reasoning capabilities. However, they are primarily designed for evaluation, with small and static datasets, minimal training examples, and limited task diversity, making them less suitable for developing or analyzing reasoning capabilities through learning.

In contrast, Enigmata supports both training and evaluation. It provides a broad set of automatically generated, knowledge-free puzzle tasks across diverse reasoning types, with verifiable rewards, difficulty scaling, and RL-compatible interfaces. These features enable scalable training, curriculum learning, and structured generalization analysis.

Importantly, our experiments show that models trained on Enigmata improve on ARC-AGI and other challenging tasks, demonstrating its utility as a foundation for building generalizable reasoning abilities. We will clarify this distinction more explicitly in the revised version.

W2: Statistical Significance and Error Bars. The authors mention significance in Section 5.3, but it only covers ablations. Tables and figures seem to lack error bars or variance estimates.

A2: We appreciate this careful observation. In fact, our evaluation protocol already incorporates repeated sampling to ensure statistical reliability. Specifically, we evaluate each AIME problem 32 times and all other benchmarks 4 times, reporting average performance across these runs. This setup is designed to reduce the effect of stochastic decoding (we use temperature = 0.7) and improve the stability of reported metrics. In response to your helpful suggestion, we have now conducted additional statistical significance tests and included error bars or confidence intervals in key tables and figures in the revised version, making variance across runs more explicit. The following table summarizes the results of one-sample t-tests on several datasets, based on our final model checkpoints. Each test compares the model's mean accuracy against a null hypothesis of 0.5 (random chance):

Dataset	Mean Accuracy	Std Error	95% CI Lower	95% CI Upper	t-statistic (vs 0.5)	p-value	Significance (α=0.05)
AIME	0.5375	0.0161	0.5059	0.5691	2.3291	0.0201	Significant
ARC-AGI	0.3646	0.0157	0.3338	0.3953	-8.6397	0.0000	Significant
Enigmata-Eval	0.6562	0.0157	0.6255	0.6870	9.9708	0.0000	Significant
KORBench	0.6260	0.0290	0.5689	0.6831	4.3478	0.0000	Significant

W3: Citation of ACES (Pourcel et al., NeurIPS 2024). I encourage the authors to cite this related work on the automatic generation of programming puzzles.

A3: Thank you for the pointer! We will cite the ACES paper (Pourcel et al., NeurIPS 2024) and briefly compare it with Enigmata.

Here are the responses to your question

Q1: Low Performance in Graph and Sequential Domains. Can this be solved with more data?

A1: Great question. We observe that both Graph and Sequential domains require multi-step planning, structured memory, and sometimes intermediate tool use—challenges that current models (even SOTA) struggle with. Our current hypothesis is two-fold:

For Sequential puzzles, we plan to explore multi-turn RL approaches, enabling models to iteratively refine and revise partial plans, rather than relying on single-pass generation. This aligns better with the inherent temporal and decision-state dependencies of these tasks.
For Graph tasks, we are investigating the use of graph-centric supervised fine-tuning, which may better capture relational structures and multi-hop dependencies. While scaling up training data may provide incremental gains, we believe that improvements in architectural bias and temporal credit assignment will be essential to make substantial progress in these domains.

Q2: What is the main benefit of Enigmata over other reasoning benchmarks?

A2: Thank you for the question. Enigmata stands out by combining scalability, diversity, and trainability in a unified benchmark. Compared to existing puzzle reasoning datasets, it covers the largest number of task categories (7) and provides a substantially larger number of tasks (36), provides automatic instance generation and verification, and uniquely supports both reinforcement learning and curriculum-based training through the RLVR paradigm.

As summarized in Table 2, Enigmata is the only benchmark that simultaneously offers diversity, scalability, verifiable reward signals, and structured training support—making it especially suitable for studying how reasoning capabilities can be acquired and generalized through learning.

2025-08-04

Thank you for the response and for clearing up some of my questions. I retain my score and continue to advocate for acceptance. I encourage the authors to integrate the statistical test results into the paper itself, as well.

审稿意见

评分: 4置信度: 32025-07-02

This paper studies the problems of reinforcement learning from verifiable feedback in the context of logical puzzles. The paper introduces a collection of logical puzzles, called ENIGMATA, for training and evaluating LLMs. ENIGMATA contains 36 logical puzzles across 7 categories in total, where each of the puzzles supports generation of problems with controlled difficulty and rule-based evaluators for reliable evaluation.

Using ENIGMATA, the paper trains Qwen-32B models which achieves strong performance on logical puzzles, and also sees some generalization to math domains. The paper also provides some analysis such as impacts of training data size and the impacts of SFT training.

优缺点分析

Strengths

The collected data suite, ENIGMATA, contains over 30 puzzles from diverse types. The tasks have some nice properties of being able to generate new examples, configurable difficulty and reliable evaluation. This collection of data suites can be useful for the research community in the field of RLVR for reasoning.

The authors also conduct experiments of running RL over the logical puzzles, which shows good in-domain results. The experiments contain multiple settings, which shows some insights on when SFT is necessary.

The paper is mostly well-written and easy to follow.

Weakness

The experiment only contains one model of one family and one scale. It is unsure how the results could generalize across different model families.

The paper claims AIME as OOD evaluation, though the paper uses some math data for training as well (ln 161). I feel to some extent, this is not completely OOD.

The paper presents a collection of gams and empirical study on using them for training model, I feel the paper could benefit from more analysis on 1) generalization across games and 2) generalization more realistic settings, like what kind of games could enable generalization on what types of OOD games.

While the paper compiles a set of tasks, most of them can be found in existing resources. The paper does not add new things for testing LLMs.

问题

See weakness for questions.

Lastly I have one practical concern (not affecting the scores), the practical value of this work maybe overshadowed by some more comprehensive collections of these kinds of tasks, such as reasoning gym (containing over 100 tasks). The paper might consider some analysis to set it apart from similar work.

局限性

yes

最终评判理由

I thank authors for the updates and increased my score a bit. At the same time, this work is largely reusing existing resources and presents a series of empirical studies. There's no new techniques proposed and somewhat limited insights. Therefore I am giving a score slightly leaning the positive side to reflect the engineering efforts presented in this work.

格式问题

N/A

作者回复

2025-07-30

Thank you for the detailed feedback and for recognizing the value of the Enigmata benchmark suite, our RL experiments, and the overall clarity of the paper.

Here are the responses to your concerns

W1: The experiment only contains one model of one family and one scale. It is unsure how the results could generalize across different model families.

A1: Thank you for raising this. While our main experiments were conducted using Qwen2.5-32B, we have extended our evaluation to include models with different architectures and scales to assess the generality of our findings. In particular, we conducted additional experiments using a proprietary mixture-of-experts (MoE) model (with 20B activated parameters and 200B total parameters, anonymized during review). This model differs substantially from Qwen in both size and structure. As shown in the table below, adding Enigmata to the MoE model's training corpus leads to consistent and significant gains across both in-domain (Enigmata-Eval) and math and STEM reasoning benchmarks such as AIME and GPQA:

Model	Enigmata-Eval	AIME 2024	AIME 2025	GPQA Diamond
MoE-Model	52.9	86.7	74.0	77.3
MoE-Model w/ Enigmata	64.6 (+11.7)	87.5(+0.8)	75.9(+1.9)	78.1(+0.8)

W2: The paper claims AIME as OOD evaluation, though some math data is also used in training, making this not completely OOD.

We would like to clarify that AIME is not presented as an out-of-distribution (OOD) evaluation in our paper. Our actual OOD evaluation is conducted on KORbench, which comprises symbolic reasoning tasks in different tasks and domains, with no overlap with our training data. We will make this clearer in the revision to avoid potential misunderstandings.

W3: Generalization Across Games and Realistic Settings. The paper would benefit from more analysis on (1) generalization across games, and (2) generalization in more realistic settings—e.g., what kinds of games help generalize to which types of OOD tasks.

A3: Thank you for this valuable suggestion. We address both aspects of your question through two complementary experiments:

(1) Cross-Game Generalization within Enigmata: We conducted a domain-level transfer experiment, where models were trained using supervised fine-tuning on only one puzzle domain, and evaluated across the remaining domains. This allows us to quantify how reasoning skills acquired in one domain transfer to others. Among all domains, the Search category produces the strongest average transfer to other domains (+13.65). This suggests that puzzles involving combinatorial exploration, planning, and constraint satisfaction can cultivate general reasoning capabilities applicable across task types. These results also offer practical guidance for curriculum-efficient training, where a subset of domains can be selected to maximize generalization.

Training Domain	Crypto	Arithmetic	Logic	Grid	Graph	Search	Sequential	Avg. Improvement
Search	+6.67	+14.37	+14.71	+11.22	+25.97	(+26.38)	+8.96	+13.65
Sequential	+10.67	+18.37	+12.49	+6.33	+21.61	+3.88	(+9.16)	+12.23
Grid	+9.33	+10.70	+13.38	(+25.07)	+25.25	+1.37	+9.16	+11.53
Graph	+5.33	+15.03	+10.04	+5.07	(+37.79)	+4.38	+7.38	+7.87
Crypto	(+51.33)	+16.03	+10.93	+4.70	+16.15	+0.38	+7.77	+9.33
Arithmetic	+4.33	(+58.70)	+8.49	-1.97	+11.25	-1.12	+4.90	+4.31
Logic	+0.33	+2.37	(+18.27)	-2.11	+6.70	-0.50	+3.21	+1.67

(2) Transfer to Realistic OOD Settings: To examine transfer to distributionally different tasks, we trained a model only on the Enigmata puzzle data, without any exposure to ARC-AGI or math-specific data, and then evaluated it on ARC-AGI, KORBench, and AIME 24. The performance is summarized below:

Model	ARC-AGI 1	ARC-AGI 2	Enigmata-Eval	KORBench	AIME 24
Qwen2.5-32B-RFT	8.4	0.0	46.6	61.0	62.0
Qwen2.5-32B w/ Enigmata	15.3	0.1	62.6	64.0	58.5
Qwen2.5-32B w/ Enigmata+ARC-AGI+AIME	32.8	0.6	62.6	65.0	60.6

The Enigmata-only model shows non-trivial improvements over baseline on ARC-AGI and KORBench, despite never seeing these benchmarks during training. This indicates that puzzle training alone can yield partial generalization to unseen, structurally different tasks, even in out-of-domain scenarios. The gains observed here affirm Enigmata's value for reasoning models.

W4: Novelty of Enigmata as a Benchmark. While the paper compiles a set of tasks, most can be found in existing resources. The paper does not seem to add new capabilities for testing LLMs.

A4: Thank you for the thoughtful feedback. While many Enigmata tasks are inspired by classic puzzles, our contribution lies in turning them into a unified, diverse, and RL-compatible benchmark. Most existing resources offer only small, static datasets—suitable for evaluation but insufficient for large-scale training or systematic analysis. Enigmata addresses this by supporting automatic generation, rule-based verification, and controllable difficulty across 36 tasks and 7 reasoning domains, enabling reinforcement learning, curriculum design, and generalization studies.

Second, Enigmata enables new ways to test LLMs by going beyond static evaluation. It supports active training, difficulty scaling, and complex reasoning, allowing us to systematically analyze how models learn and generalize across tasks. which together allow systematic analysis of how models learn, adapt, and generalize. Importantly, Enigmata focuses on symbolic puzzle reasoning—which, unlike knowledge-intensive tasks, targets core reasoning skills without relying on prior knowledge or memorization. This makes it a valuable resource for isolating and understanding the fundamental reasoning abilities of LLMs. We will clarify this distinction more explicitly in the revised version.

Here are the responses to your question

Q1: Comparison with Reasoning Gym. The practical value of this benchmark may be overshadowed by more comprehensive collections like Reasoning Gym. The paper might consider analysis to set it apart.

A1: Thank you for the thoughtful comment. Enigmata and Reasoning Gym were developed independently as contemporary works, so our original submission did not include a detailed comparison. We appreciate the suggestion and will add one in the revision.

The two benchmarks differ in focus: Reasoning Gym emphasizes broad cognitive coverage and is primarily designed for supervised evaluation, whereas Enigmata targets knowledge-free, puzzle-style reasoning and supports RL-based training via automatic generation, verifiable rewards, and controllable difficulty. We view the two as complementary in goals and design. Enigmata's unique value lies in enabling scalable training and fine-grained analysis of how LLMs acquire and generalize core reasoning skills.

2025-08-06

Thank you for your response and added analysis. I will increase my score to the positive side.

2025-08-04

审稿意见

评分: 5置信度: 32025-07-03

The paper introduces Enigmata, a comprehensive and scalable benchmark suite aimed at enhancing puzzle reasoning in LLMs. It features 36 tasks across 7 categories, each equipped with a deterministic rule-based verifier, enabling large-scale multi-task training and integration with Reinforcement Learning with Verifiable Rewards (RLVR). The authors also propose Enigmata-Eval as a rigorous evaluation benchmark and develop multi-task RLVR strategies. Their trained model, Qwen2.5-32B-Enigmata, outperforms leading models like o1 and o3-mini-high on several reasoning benchmarks about puzzles and maths.

优缺点分析

The model demonstrates impressive performance across a comprehensive suite of puzzle and math benchmarks, even performing on par with the proprietary LLMs. The writing is clear and accessible, and the method and analysis is presented in a simple yet effective manner. However, the paper may lack key experiments as follows:

Transferability of Puzzle Training Data Alone: It’s encouraging to see that combining puzzle and math tasks in the proposed training framework yields strong performance across both domains. However, it remains unclear how effective puzzle training data alone is at supporting generalization, especially given that many puzzles involve complex, long-horizon reasoning and mathematical computation. A deeper investigation into the standalone impact of puzzle data on reasoning capabilities would be valuable. Also, identifying which specific puzzle domains contribute most to transferable skills could guide future efforts in data selection for promoting general reasoning abilities.

Cross-Domain Transferability in Puzzle Games: It would also be insightful to explore whether training on one puzzle domain facilitates generalization to others. Since many puzzles share underlying cognitive skills (e.g., constraint satisfaction, pattern recognition), cross-domain transfer is a promising area of inquiry. A data efficiency study, such as identifying the most representative subsets of puzzles for skill acquisition, could further illuminate how to build compact yet powerful training curricula.

Transfer Across Difficulty Levels: The dataset spans tasks of varying difficulty: easy, medium, hard. An interesting question is whether models trained only on easy tasks can generalize to harder ones, which was observed in other domains such as SimpleRL (Zeng et al., 2025). Studying this easy-to-hard generalization would provide insight into how task difficulty impacts skill development and transfer.

问题

Choice of VC-PPO: The paper adopts VC-PPO, but it’s unclear why this particular variant was chosen over alternatives such as VC-GRPO or VC-DAPO. A more comprehensive comparison with other common RL algorithms would help clarify whether VC-PPO is especially well-suited for this setting or whether other methods might perform similarly or better.

局限性

See Weaknesses part.

最终评判理由

Thank you for the comprehensive evaluation results! I’m particularly impressed by the range of transferability experiments. It’s worth highlighting that puzzles can be highly effective in enhancing other skills as well. I’ll go ahead and raise my score to 5.

格式问题

N/A

作者回复

2025-07-30

Thank you for the insightful feedback! We appreciate your recognition of the model's strong performance across both puzzle and math benchmarks, as well as the clarity of our presentation.

Here are the responses to your concerns

W1: Transferability of Puzzle Training Data Alone. It remains unclear how effective puzzle training data alone is at supporting generalization. A deeper investigation into the standalone impact of puzzle data on reasoning capabilities would be valuable.

A1: Thanks for highlighting this. To better evaluate the standalone effect of puzzle data, we conducted an experiment where the model was trained only on the Enigmata dataset, without any additional math or external reasoning tasks. The results are shown in the table below. We observe that training with Enigmata puzzle data alone not only substantially improves performance on Enigmata-Eval, but also yields strong generalization to out-of-domain puzzle benchmarks such as ARC-AGI 1, ARC-AGI 2, and KORBench during the RL phase.

Model	ARC-AGI 1	ARC-AGI 2	Enigmata-Eval	KORBench	AIME 24
Qwen2.5-32B-RFT	8.4	0.0	46.6	61.0	62.0
Qwen2.5-32B w/ Enigmata	15.3	0.1	62.6	64.0	58.5
Qwen2.5-32B w/ Enigmata+ARC-AGI+AIME	32.8	0.6	62.6	65.0	60.6

Furthermore, we observe similar trends at a larger scale using a proprietary MoE model (20B activated parameters, 200B total parameters, anonymized during review). By adding Enigmata to the original full training corpus (MoE-Model+Enigmata), we observe consistent performance improvements across multiple reasoning benchmarks, including AIME and GPQA for math and STEM reasoning.

Model	Enigmata-Eval	AIME 2024	AIME 2025	GPQA Diamond
MoE-Model	52.9	86.7	74.0	77.3
MoE-Model w/ Enigmata	64.6 (+11.7)	87.5(+0.8)	75.9(+1.9)	78.1(+0.8)

W2: Cross-Domain Transferability in Puzzle Games. It would be insightful to explore whether training on one puzzle domain facilitates generalization to others. A data efficiency study could help identify compact training curricula.

A2: We appreciate your suggestion. To directly evaluate cross-domain transferability, we conducted a new experiment where models were trained via SFT on only one puzzle domain, and then evaluated on the remaining six. This setup allows us to measure how effectively reasoning skills learned in one domain transfer to others. The results are summarized below:

Training Domain	Crypto	Arithmetic	Logic	Grid	Graph	Search	Sequential	Avg. Improvement
Search	+6.67	+14.37	+14.71	+11.22	+25.97	(+26.38)	+8.96	+13.65
Sequential	+10.67	+18.37	+12.49	+6.33	+21.61	+3.88	(+9.16)	+12.23
Grid	+9.33	+10.70	+13.38	(+25.07)	+25.25	+1.37	+9.16	+11.53
Graph	+5.33	+15.03	+10.04	+5.07	(+37.79)	+4.38	+7.38	+7.87
Crypto	(+51.33)	+16.03	+10.93	+4.70	+16.15	+0.38	+7.77	+9.33
Arithmetic	+4.33	(+58.70)	+8.49	-1.97	+11.25	-1.12	+4.90	+4.31
Logic	+0.33	+2.37	(+18.27)	-2.11	+6.70	-0.50	+3.21	+1.67

As shown, training on the Search domain leads to the strongest average transfer (+13.65) across other domains. This suggests that Search puzzles—characterized by long-horizon, combinatorial reasoning—promote general cognitive skills such as constraint satisfaction, backtracking, and planning. These skills appear to transfer well to a variety of symbolic reasoning tasks, making Search a promising candidate for curriculum-efficient training. We believe this analysis provides a concrete foundation for selecting representative domains to construct compact yet generalizable reasoning datasets.

W3: Transfer Across Difficulty Levels. An interesting question is whether models trained only on easy tasks can generalize to harder ones, which was observed in other domains such as SimpleRL (Zeng et al., 2025). Studying this easy-to-hard generalization would provide insight into how task difficulty impacts skill development and transfer.

A3: We appreciate you for raising this insightful point. We conducted targeted experiments to explore easy-to-hard generalization, particularly within the ARC-AGI benchmark. We trained models using ARC-AGI data with two different prompt length limits—4K (representing "easy" short-context training) and 6K (medium-length). We then evaluated performance on test samples grouped by their required context lengths. As shown below, models trained with shorter prompts struggled significantly on longer/harder inputs, demonstrating that generalization across difficulty levels remains limited:

Train Prompt Len	Model	Overall Performance	Test Prompt Len: (0, 4096]	Test Prompt Len: (4096, 6144]	Test Prompt Len: (6144, ∞)
6k	arc-agi1	39.6	49.35%	17.31%	7.33%
6k	arc-agi2	0.3	0.64%	0.00%	0.00%
4k	arc-agi1	32.3	43.28%	3.37%	0.00%
4k	arc-agi2	0.1	0.32%	0.00%	0.00%

Furthermore, we conducted qualitative analysis of DeepSeek-R1 solving maze puzzles at different difficulty levels. In easy cases, models adopt a greedy, forward-only strategy with no need for revision or backtracking. In contrast, hard instances require global search, recursive planning, and backtracking, occasionally involving code-style planning logic. Below is an excerpted comparison:

// Easy
  - Starting at (1,1). Possible moves: right is blocked (B at (1,2)), down is (2,1). So first move down to (2,1).
  - From (2,1), possible moves: up is back to start, down to (3,1), right to (2,2). Let's go right to (2,2).
  - At (2,2), possible moves: up is (1,2) which is B, down to (3,2), left to (2,1), right to (2,3) which is B. So move down to (3,2).
  - At (3,2), possible moves: up to (2,2), down to (4,2), right to (3,3) which is B. Let's move down to (4,2).
  - At (4,2), possible moves: up to (3,2), down to (5,2), right to (4,3) which is B. So move down to (5,2).
  - Now at (5,2). Possible moves: up to (4,2), right to (5,3), left to (5,1). Let's go right to (5,3).
  - At (5,3), can move right to (5,4), then to (5,5) which is E. So path would be:
  - (1,1) -> (2,1) -> (2,2) -> (3,2) -> (4,2) -> (5,2) -> (5,3) -> (5,4) -> (5,5)

// Hard
  -  ...
  - Neighbors:
  - Up: (0,0) visited.
  - Down: (2,0). maze[2][0] is '.', not visited. Add to queue. Path: [(1,1), (2,1), (3,1)]. Mark visited[2][0] = True.
  - Right: (1,1). maze[1][1] is '.', not visited. Add to queue. Path: [(1,1), (2,1), (2,2)]. Mark visited[1][1] = True.
  - Left: invalid.
  - So queue now has (0,1), (2,0), (1,1).
  - Next, process (0,1) (1,2 in 1-based).
  - Check if end? No.
  - ....
  - Let me retrace.
  - Possible path:
  - Start at (1,1) → (2,1) → (3,1) → (3,2) → (3,3) → (3,4) → (4,4) → (4,5) → (5,5).
  - Wait, (4,4) is array[3][3], which is B. So that's not possible. Hmm.
  - Wait, the third row (array index 2) is ['.', '.', '.', '.', 'B']. So (3,4) is array[2][3], which is '.' (coordinates (3,4)). Then moving down to (4,4) (array[3][3] which is B) is blocked. So that's a dead end.
  - Alternative path from (3,4) could be right to (3,5), but that's B. So can't go there.
  - So maybe from (3,3), move right to (3,4), then up to (2,4) (but array[1][3] is B), so no.
  - Wait, let's think of another approach. Maybe go from (1,1) right to (1,2), then down.

Responses to your question

Q1: VC-PPO vs. Other RL Variants. The choice of VC-PPO over VC-GRPO or VC-DAPO is unclear. A broader comparison would help assess its suitability.

A1: Great question. We do not assume any requirements for the adopted RL algorithms in our work. In early-stage experiments, we did run our experiments on GRPO. However, we ultimately adopted VC-PPO due to its faster convergence, better runtime efficiency, and smoother learning dynamics. Due to limited time and resources, we chose to run our experiments on VC-PPO as a standard algorithm to ablate over data and training strategies.

2025-08-04

最终决定Accept (spotlight)

2025-09-17

The paper introduces a collection of puzzles, called ENIGMATA, for training and evaluating the reasoning capability of LLMs. ENIGMATA contains 36 puzzle tasks across 7 categories in total. Besides the benchmark dataset, the paper also studies different training strategies (combinations of RFT, mix-training & multi-stage training RL) and demonstrates the significant improvement of reasoning capability after the proposed training solutions on ENIGMATA, ARC-AGI1/2.

The dataset introduced in this paper is regarded as highly valuable by all reviewers. The study of different training strategies is also insightful. At the same time, reviewers also point out some room to improve in experiments. Especially, only QWen2.5-32B model is studied, more models from other vendors (e.g., DeepSeek V3, Llama models). More in-depth investigation into the standalone impact of puzzle data on reasoning capabilities on other general tasks will be a good plus.