PaperHub
8.2
/10
Poster4 位审稿人
最低5最高5标准差0.0
5
5
5
5
3.3
置信度
创新性3.0
质量3.8
清晰度3.3
重要性3.0
NeurIPS 2025

Generating Computational Cognitive models using Large Language Models

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29

摘要

关键词
large language modelscognitive computational modelscognitive sciencelearningdecision makingneuroscience

评审与讨论

审稿意见
5

This paper is builds computational cognitive models using LLMs. It describes the method for generating computational cognitive models called GeCCo and then applies this to 4 main experimental outcomes: 1. Decision Making 2. Learning 3. Planning 4. Working Memory It then evaluates the effects of different control experiments and finishes with a discussion of the model discovery capabilities.

优缺点分析

Strengths • The proposed method has strong empirical performance across different cognitive tasks and models • The proposed method relies only on in-context learning and is tested with open-source models, contributing to its accessibility and reproducibility • Control experiments, ablation studies, and ground truth recovery analysis add rigor to the results • Good experiment setup for the 4 classes. Clearly written with value proposition and experimental setup. • Ablation studies on prompt components was useful and Figure 9 was especially interesting. Curious as to how variance on these components (e.g. how to formulate each of those portions of the prompts) also affects things.

Weaknesses • The paper focuses on evaluating the predictive performance of the LLM-generated models, but it does not assess their scientific value. Do the models offer new cognitive insights or reinforce existing findings? Are the underlying mechanisms theoretically plausible? The paper could benefit from additional qualitative analysis or evaluations by domain experts to help establish the scientific utility and interpretability of the generated models. • Using BIC as the only model selection criterion may bias toward simpler models. Including alternative measures (e.g., AIC, cross-validation) would help support the paper’s claims about models’ generalizability. • The paper fixes the number of iterations to 10 but does not analyze how model quality evolves across iterations. It would be useful to examine whether performance plateaus or continues improving, and whether fewer iterations can be used. • Would love more detail in the method section to fully understand the approach with more expanded and detailed explanation. The current description is just so incredibly information dense, it is difficult to understand some of the exact methodology and algorithm without source code or some more structured framework describing the approach and its specifics. An appendix of some exemplar models would be helpful for understanding the underlying methods a bit better.

问题

How do some of the other popular models perform on these tasks (E.g. models from OpenAI / Anthropic / Google).

I see you performed prompt ablations, but how did you optimize and arrive at your specific prompt form for each section?

What about using a reasoning models without a code template (since those don't rely as much on few shot)?

局限性

yes

最终评判理由

This is a well written paper and I maintain my original score recommendation after reading the rebuttals. It was a technically solid paper and contained some good contributions that are worth sharing with the field.

格式问题

None

作者回复

Dear reviewer YeVa,

Thank you for the comments and questions. We are glad you found that the ‘proposed method has strong empirical performance’ and that by relying on in-context learning and open-source models, it contributes to accessibility and reproducibility. We are also happy that you found the experimental setups for the four cognitive domains ‘clearly written’ along with their ‘value proposition’. Below we discuss the specific concerns that you raised and how we have sought to address them.

  1. Addressed scientific value, plausibility, and insights from GeCCo models
  2. Ran an experiment with GeCCo using an alternative fitness metric
  3. Tested early stopping instead of a fixed iteration count
  4. Expanded the explanation of our approach
  5. Responded to the question on closed-source models
  6. Explained how we derived the prompt structure
  7. Ran an experiment with a reasoning model without a code template

1. Scientific value, plausibility, and insights from GeCCo models

Each GeCCo-generated model was carefully reviewed by domain experts and deemed plausible. The cognitive parameters used (e.g., forgetting rate, lapse, loss aversion, stickiness) were reasonable, and the modeled processes aligned with domain expectations. While scientific value is more subjective, we summarize our key takeaways from each model below.

  • Decision making: The GeCCo-generated model offers a parsimonious explanation of possible decision-making strategies. It features an elegant implementation that enables arbitration between different heuristics through a single parameter, highlighting the model’s interpretability and compactness.

  • Learning: The GeCCo-generated model separates two sets of Q-values: one updated from chosen outcomes and another counterfactual set from forgone outcomes. While these are often combined with asymmetric learning rates, GeCCo model keeps them distinct—raising questions about how experienced and counterfactual values are weighted. It also includes forgetting for unchosen actions, introducing temporal asymmetry in value retention, and avoids using choice stickiness, which is often added primarily to improve model fit in this task.

  • Planning: The two-step task is typically modeled with a hybrid model-based/model-free (MB/MF) framework. The GeCCo-generated model captures key behavioral patterns using general principles of transition learning and value updating, without explicitly separating MB and MF systems. This offers an alternative perspective, aligning with recent encouragement from cognitive scientists to move beyond strict adherence to the MB/MF dichotomy (Collins & Cockburn, 2020).

  • Working memory: The GeCCo-generated model attributed cognitive load effects to the reinforcement learning process rather than to working memory mechanisms. While unconventional, this aligns with findings that reward prediction errors (RPEs)—linked to striatal RL signals—are reduced under low cognitive load (Collins et al., 2017), suggesting RL processes can be, at least indirectly, shaped by information load.

2. Alternative fitness metric

We agree that different model selection criteria (e.g., AIC) can be used to construct feedback for GeCCo. Following your suggestion, we ran the GeCCo pipeline using AIC-based feedback in two domains: decision making and planning.

In the decision-making domain, the best AIC-favored model included three parameters—temperature, discount factor, and an option difference penalty—differing from the BIC-selected model by the inclusion of one additional parameter. The AIC model outperformed the literature baseline (t(52) = 41.40, p < .001), but underperformed relative to the BIC-based GeCCo model (t(52) = 29.73, p < .001).

In the planning domain, the best AIC model included four parameters: learning rate, softmax temperature, forgetting rate, and policy weight. As in the decision-making domain, it differed from the BIC-selected model by one parameter. The AIC model's fit did not differ significantly from the literature baseline (t(15) = 1.36, p = .09) or the BIC-based GeCCo model (t(15) = 0.95, p = .82).

These results suggest that using AIC instead of BIC may lead to slightly more complex models, but the differences in model structure and overall fit are modest. The takeaway is that the GeCCo pipeline is compatible with either fitness metric for feedback generation, leaving it to the researcher to explore and choose the criterion that best suits their modeling goals. Regarding cross-validation, our current implementation already evaluates each model on held-out data—that is, on participants whose data are not included in either the prompt or the feedback optimization process.

3. Fixed iterations and early stopping of GeCCo

We direct the reviewer to Figure 11 in the Appendix, which shows how GeCCo’s performance improves across iterations in two domains. BIC scores tend to decrease, then plateau, and doesn't necessarily improve after that. We will reference this figure in the main text to ensure it’s not overlooked.

To improve efficiency, we tested an early stopping version of GeCCo that halts when a generated model matches or outperforms the baseline model's BIC. Applied to the planning domain, this approach produced a model that fit the data comparably to both the literature model (t(15) = 1.36, p = 0.09) and the fixed-iteration GeCCo model (t(15) = 2.08, p = 0.07). This model included stage-dependent learning rates, a decision temperature, and a transition learning parameter.

These results suggest GeCCo does not require a fixed number of iterations, though using more iterations may benefit researchers interested in model diversity. On average, early stopping required just 2 iterations across 5 runs.

4. More detailed description of the approach

We attempt to further clarify our approach below as a step-by-step guideline. In the camera-ready version, we will:

  • Include the final best model for each domain in the Appendix
  • Make our code publicly available
  • Use the additional page to offer further methodological details as needed

GeCCo Pipeline Overview:

  • Step 1: Prompt construction (see Figure 6 in the Appendix)
    We construct a prompt that includes:

    • (a) a description of the experimental task
    • (b) data from a subset of participants
    • (c) instructions for the code-generation task—specifying the function name, expected input arguments, return values, etc.
    • (d) an example model (provided as a function template) to guide output structure and format
  • Step 2: Model generation
    Using the prompt from Step 1, we prompt the LLM to generate three distinct cognitive models with non-overlapping parameter sets. These models are parsed from the LLM response, converted into executable Python functions (e.g., using exec(function_string)), and parameter names are extracted.

  • Step 3: Model fitting and evaluation
    Each model is fit to a second, held-out subset of participant data (i.e., not used in the prompt). We evaluate model fit using the Bayesian Information Criterion (BIC), and optimize using scipy.optimize.minimize with the L-BFGS-B algorithm, initialized from 20 random starting points.

  • Step 4: Feedback construction
    We identify the best-performing model based on BIC:

    • On the first iteration, this is the best of the three newly generated models
    • On later iterations, it is the best model across all iterations so far
    • This model and a list of previously used parameter names are included in the next prompt as feedback to guide generation and avoid duplication
  • Step 5: Iterations
    Steps 1–4 are repeated for 10 iterations. At each iteration > 1, model generation is guided by feedback, and the best-so-far model is updated accordingly.

  • Step 6: Replication across 5 runs
    The full pipeline (Steps 1–5) is run five times to assess stability.

Notes:

  • If model generation fails (e.g., due to syntax or runtime errors), the iteration is automatically restarted.
  • The best model from each full run is tested on a third subset of the data. These are the results reported in the main paper.

5. Using closed-source models

We were unable to consider closed-source models like GPT, Claude, or Gemini in this work due to cost constraints and our commitment to making GeCCo accessible to the broader scientific community. While it’s possible that these models might yield better performance, we believe it is our responsibility to optimize the pipeline using open-source models that are freely accessible and reproducible.

6. Deriving the prompt structure

We identified key informational components based on the needs of computational modeling: a description of the task, participant data, priors over possible cognitive models. Additional components—such as function specifications and guardrails for the LLM—were designed based on what worked best with our text parsers. For example, we structured these elements to enable reliable function extraction from the LLM's responses. Once finalized, we used a fixed prompt structure across all domains (see Figure 6), varying only the task descriptions and corresponding data.

7. Reasoning model without a code template

As suggested, we ran the GeCCo pipeline using a reasoning model (R1 distilled Llama, for consistency) without a predefined template in the decision-making domain. The best model included a single parameter—decision temperature—and achieved a BIC comparable to the literature model (mean = 53.7, SEM = 5.56; t(52) = 0.48, p = 0.31), but performed notably worse than the best model generated with a template. This suggests that while reasoning LLMs can produce meaningful models without templates, doing so does not necessarily yield better results.

评论

I thank the authors for their thoughtful responses and clarifications. I will maintain my score, which reflects a well written paper submission.

评论

Thank you for taking the time and effort to provide detailed comments on our paper!

审稿意见
5

Cognitive models, in the sense of this work, are interpretable generative models with a small number of parameters that aim to explain human behavior in an experiment. Here, the authors prompt LLMs to write a Python program to explain the data from a human experiment. The prompting method is quite simple, works for open weights models, and yields interpretable models that have a better fit to human data.

优缺点分析

STRENGTHS

The paper is well-written and the framing is compelling.

The problem is important: an automated way to speed up model discovery could accelerate science in general and cognitive science in particular.

In four case studies, the paper shows that the method is effective in finding interpretable models that explain the human data well, sometimes better than the best existing hand-crafted models.

Thorough follow-up experiments and ablations isolate the components of the prompting pipeline and the LLMs that make them most effective at this task.

The fit to human data is measured carefully using BIC.

WEAKNESSES

From a methodological perspective, the novelty is somewhat limited. Prior work has shown that LMs can be prompted to construct programs that explain data, especially the Li et al paper on statistical model discovery (cited in this manuscript). So if I understand correctly the main novelty here is the application to cognitive science, but this isn't made fully clear in the introduction. I am happy to be corrected on this and would appreciate author engagement.

From the cognitive science perspective, it was hard for me to appreciate what new insight emerged from the automatically constructed models. There are four case studies, each on a different cogntiive domain, and there's not enough background information on any of them to convey to a reader who's not already familiar with those tasks what the LLM-generated cognitive model discovered and how it differs from existing models. Perhaps one of the case study is a slam dunk for the method but this should be made clearer; one option could be to move one or two the case studies to the appendix and focusing on the biggest success stories.

The tasks and resulting cognitive models are quite simple. I would have appreciated more detail on the synthetic data experiments, which could provide a nice way to determine whether the method can scale to more complex datasets.

问题

It is hard to tell if the differences in prompting strategy between the current paper and the Li et al model discovery paper were significant, since no attempt is made to compare the new prompting pipeline to the Li et al one. Could you provide some clarifications on the comparison?

局限性

Yes.

最终评判理由

I thank the reviewers again for engaging with my comments! To summarize, I found the paper good but not excellent: on the methodological front, the idea is very sensible but seem quite related to earlier work on using LLM prompting for automatic model discovery; on the scientific front, because there are so many qualitative case studies for a short paper, it was a little hard for me to understand what insight was delivered by the method. The being said, it's a good proof of concept for automatic model discovery in cognitive science, and certainly above the acceptance threshold in my view.

格式问题

None.

作者回复

Dear Reviewer L7y4,

Thank you for your comments. We are glad that you found the paper 'well-written’, framing ‘compelling’ and experiments ‘thorough’. We are particularly happy that you found the problem of automating cognitive model generating ‘important’. Below, we address the comments, concerns, and questions you raised in detail, and how we sought to address them. We believe that resolving them has further improved clarity and contribution of our paper.

  1. Highlighted scientific insights from GeCCo-generated models
  2. Addressed the inclusion of all domain results
  3. Elaborated on the comparison to Li et al. (2024) paper
  4. Tested GeCCo performance in a more complex setting using synthetic data

1. Scientific insights from GeCCo-generated models

The value of the GeCCo approach is in its ability to generate alternative models that explain behavioral data while achieving comparable—if not better—fit and predictive accuracy compared to the best domain-specific models in the literature. Below, we summarize domain-specific insights from GeCCo-generated models. These points are briefly mentioned in the respective results sections, and we would be happy to expand on them further in the camera-ready version:

  • Decision Making: The GeCCo-generated model uses only two parameters. Notably, it introduces a single discount term that enables the model to arbitrate between different heuristics (e.g., weighted additive vs. take-the-best), without requiring these heuristics to be explicitly encoded in the model. This results in a more parsimonious explanation of decision-making strategies.
  • Learning: The GeCCo-generated model distinguishes between two sets of Q-values: one tracking values updated based on chosen outcomes, and a separate counterfactual (meta) set updated from forgone outcomes. While these are typically combined into a single value representation with asymmetric learning rates, GeCCo separates them, raising interesting questions about how counterfactual and experienced values might be weighted during decision-making. Additionally, the model incorporates forgetting for unchosen actions, introducing a temporal asymmetry in how value information is retained, and sidesteps the need for choice stickiness, which is often included primarily to improve model fit, rather than out of theoretical interest.
  • Two-Step Task (Planning): The model produced by GeCCo avoids the standard model-based vs. model-free dichotomy, yet still captures plausible strategies participants may use. For example, it accounts for participants differentiating between common and rare transitions, and updating reward expectations accordingly, but without relying on the traditional components of the reinforcement learning framework. This aligns with recent calls in cognitive science (e.g., see Collins & Cockburn, 2020) to look beyond canonical constructs and explore alternative computational explanations.
  • Working Memory: Traditional models attribute set size effects (i.e., cognitive load) to working memory limitations. In contrast, the GeCCo-generated model attributes these effects to the RL system, specifically by allowing the learning rate to vary with set size. This challenges the common assumption that RL processes are relatively insensitive to cognitive load, but is consistent with prior findings that neural reward prediction error (RPE) signals, a hallmark of reinforcement learning, are attenuated under high cognitive load (Collins et al., 2017).

2. Inclusion of all domain results

We understand that the descriptions in the main paper may be too concise for readers without a background in cognitive science. Due to space constraints, we have included detailed descriptions of each of the four tasks in the Appendix and referenced them in the main text. Regarding choosing success stories to focus on - while the definition of the success case is subjective, it is our impression that all of the GeCCo-genreated models offer interesting insights as outlined in detail above; it would be somewhat arbitrary to single out only one (or two) of them. Additionally, we believe that the consistency of performance across domains is itself a valuable and nontrivial outcome, particularly for a first time demonstration of a general purpose model generation pipeline.

Regarding differences between GeCCo generated models and the respective baselines - the model insights we outlined above explain what GeCCo generated models offer beyond the baselines. We will additionally create a difference table, that will more explicitly show GeCCo-baseline model differences, including model parameters and model summaries for each of the domains. We will add this table to the Appendix, and reference it in the main text.

3. Comparison to Li et al. (2024)

We appreciate the reviewer’s observation and agree that the Li et al. paper is highly relevant to our work. We already cited it in the manuscript, but we will revise the text to more clearly acknowledge the methodological similarities in the Introduction. Both approaches share an iterative structure involving model generation, evaluation, and refinement. We also agree that the general structure of our prompt is similar to that of Li et al. However, our prompt structure was naturally derived from the requirements involved in computational cognitive modeling (having a description of the experiment, the data, and some priors regarding viable model structures). We believe that the prompt similarities between our approaches are likely a result of the shared nature of the model generation task.

We agree that perhaps the most substantive difference lies in the domain focus. Cognitive modeling introduces unique challenges—such as behaviorally grounded evaluation metrics, domain knowledge, interpretability aligned with psychological theory, and domain-specific constraints on model plausibility—that are not guaranteed to be addressed by pipelines developed for general statistical modeling.

One important implementation difference lies in the feedback mechanism: in Li et al., the feedback is LLM-generated (via a separate critic model that provides natural language critiques), whereas in GeCCo, the feedback consists of the best-performing model code and a list of previously attempted parameter names, explicitly structured to avoid duplicate models and promote exploration. To clarify this connection, we will add a more detailed discussion of the similarities and differences with Li et al. in both the Introduction and Related Work sections of the paper.

4. GeCCo performance in a more complex setting using synthetic data

We liked the reviewer’s suggestion of using synthetic experiments to verify the scalability of GeCCo to more complex setups. To this end, we conducted additional experiments using synthetic data with systematically increasing complexity. Specifically, we simulated 100 reinforcement learning (RL) agents performing multi-armed bandit tasks with a varying number of options: 2, 4, 6, and 8 arms, each with 150 trials. While 2-armed and 4-armed bandits are common in the literature, tasks with 6 or 8 arms are rarely studied due to their increased complexity. We applied the GeCCo pipeline to each of these datasets and compared the resulting models to the ground-truth models used to generate the data. We found that the models discovered by GeCCo achieved comparable fit to the ground truth, and substantially outperformed a random guessing model:

  • 2-armed bandit task:

    • GeCCo BIC = 135.03 (SEM = 5.73)
    • Ground truth BIC = 131.00 (SEM = 5.63)
    • Random guessing BIC = 207.94
  • 4-armed bandit task:

    • GeCCo BIC = 350.42 (SEM = 7.39)
    • Ground truth BIC = 337.66 (SEM = 8.86)
    • Random guessing BIC = 415.88
  • 6-armed bandit task:

    • GeCCo BIC = 483.60 (SEM = 6.51)
    • Ground truth BIC = 467.30 (SEM = 9.84)
    • Random guessing BIC = 537.52
  • 8-armed bandit task:

    • GeCCo BIC = 551.50 (SEM = 10.50)
    • Ground truth BIC = 552.73 (SEM = 10.54)
    • Random guessing BIC = 623.83

These results demonstrate that GeCCo generalizes well to more complex task environments, and maintains strong performance even as the decision space increases substantially.

评论

I thank the authors for the thorough engagement with my review, and for pulling out the positive adjectives from my review! I don't see a reason to change the score, the paper is good but has certain limitations and the response didn't convince me otherwise.

评论

Thank you for taking the time to provide your valuable comments!

审稿意见
5

Computational cognitive models are often hand-crafted by researchers, a process that can be highly time-consuming. In this work, the authors develop a new pipeline (GeCCo) for the automated generation of computational cognitive models. The authors apply GeCCo in 4 domains of cognitive science and assess the role of each component of their pipeline, as well as the base models.

优缺点分析

The paper is very well-motivated. The problem the authors take on, and the general method they develop, could have broad implications for cognitive science. The paper definitely hits the mark on originality.

Each step in the proposed pipeline was also generally well-motivated, and I appreciated that the authors conducted rigorous ablations into the role of each component.

Where the paper struggled, in some respects, is in the actual communication of the experimental findings -- particularly cross-model comparisons. The authors rely heavily on BIC as a measure; however, the actual scale of BIC is not communicated anywhere in the paper. It is clear in many cases that BIC "goes down" for the LLM-generated models, but it's hard to understand from the aggregate measures alone whether that is "significant" (beyond statistical significance). The error bars were hard to interpret as well (and seemed large in many instances). Is this SEM over seeds (the 5 runs?) or bootstrapped over participant data?

Moreover, in many instances, the BIC is actually not substantially going down? The role of the base models in producing good computational cognitive models is also highly sporadic. I did not walk away with a clear picture on the role of the base model in synthesis. The motivation suggested that the authors did care about base model, and of course -- as in good research! -- the results may be surprising (or negative, into the role of base models, as they seem to be here). However, I would want to see a bit more of an analysis into the role of base model. One possible way the authors could better understand the role of base model is by trying LLaMA 3.1 8b? Or at least one other model of the same class as one of the models they considered here? The choice of models seemed a bit scattered. For instance, we may expect that the results overall are worse when using 8b. But if they are not, then that would throw a bit of a wrench into GeCCo and the results would seem too sporadic; like it just matters that the models "happen" upon a good synthesized model. If possible, I'd encourage the authors to try 8b if they have not already.

There was also limited detail on the way the paired t-tests was computed. Paired over per-participant judgements?

问题

I would especially like a clearer description of the scale of BIC and error bar computation (as noted above), and other statistical tests run.

The ablations and Fig 11 indicate that feedback was very important. It'd be great if the authors could dig into this more (I believe the figure is never referenced in the text?)

R1 beta spanned zero in the statistical models -- so can we really say it produced "better" models? I'd consider adjusting that sentence (or moving those analyses fully to the Supplement?)

局限性

Yes

最终评判理由

The paper is a substantive modeling advance. While I had initial concerns about the clarity and quality of the results, the authors assuaged many of my concerns in their rebuttal. I am hopeful that by having them clarify the prose -- this will make for a strong paper that I believe warrants acceptance.

格式问题

The paper was generally well-formatted.

Generally, my main writing comment is that metrics should be defined in the text to help the reader interpret (e.g., for data contamination as well).

作者回复

Dear Reviewer xLM6,

Thank you for your comments. We are happy that the reviewer found are paper to be ‘very well-motivated’ and that our paper hits the mark in terms of ‘originality’. In particular, we are happy that the reviewer shares our enthusiasm about GeCCo having ‘broad implications for cognitive science’. Below, we address the concerns and questions you have raised and how we addressed them. We believe that resolving them has further improved the clarity and contribution of our paper.

  1. Clarified the interpretation of BIC values
  2. Clarified the role of base models
  3. Clarified the performance metrics and statistical tests
  4. Clarified the role of feedback
  5. Clarified the claims regarding R1 performance

1. Interpretation of BIC values

We interpreted your comment as a suggestion to clarify the range and interpretation of BIC values. For each GeCCo-generated model, we used the average left-out sample BIC of the best-performing literature model as a benchmark for best-case fit, and a random guessing model as the lower bound. Based on this, the BIC ranges for the four domains are as follows (in the order of best literature model – random guessing):

  • Decision Making: [52.42 – 133.08]
  • Learning: [85.24 – 221.80]
  • Planning: [437.24 – 585.01]
  • Working Memory: [275.00 – 415.27]

Our primary goal was to interpret BIC scores in a relative sense, ensuring that the GeCCo models at least approximated the BIC scores of the best existing models. To further support these comparisons, we also conducted exceedance probability (EXP) analysis, which estimates the likelihood that a given model is the most frequent explanation of participants’ behavior (Stephan et al., 2009). The results are as follows:

  • Decision Making: GeCCo model EXP = 0.99, Literature model EXP = 0.01
  • Learning: GeCCo model EXP = 0.97, Literature model EXP = 0.03
  • Planning: GeCCo model EXP = 0.85, Literature model EXP = 0.15
  • Working Memory: GeCCo model EXP = 0.99, Literature model EXP = 0.01

Error bars in the main figures reflect variability across participants. We will update the figure captions to include BIC scores for the random guessing models and the explanation of how the SEMs were computed for greater clarity.

Regarding the relatively small differences in BIC values, we would like to emphasize that the literature models used as baselines for comparison are among the most well-established and validated cognitive models in their respective domains, developed over many years of research. Therefore, outperforming these models is understandably challenging. Nonetheless, GeCCo offers valuable insights by generating alternative, cognitively plausible models that match, and in some cases outperform, these strong baselines. We believe this ability to propose viable alternative explanations remains an important contribution.

2. The role of base models

To explore how different model features (size, model family, and reasoning capabilities) influence performance on the model generation task, we selected three models—Llama, Qwen, and R1 (distilled Llama). While LLama and Qwen come from two different model families and sizes (70B and 72B, respectively), R1-distilled-Llama-70B provides the means to examine whether reasoning capability contributes to the quality of the generated models. As per the reviewer's suggestion, we considered the smaller version of Llama with 8B parameters.

Specifically, we tested Llama-8B in the decision making domain, which presents the simplest model case, and found that the average BIC for the best model generated by Llama-8B was 89.58 (SEM over participants = 2.20), which is roughly twice as high as the BICs for models generated by R1-distilled-Llama-70B and Llama-70B (M = 39.30  ±\pm 4.59)—indicating comparatively weaker performance. This model effectively just picked an option with a higher sum over features scaled by estimated weights.

Critically, we would also like to bring to your attention that running Llama-8B requires providing additional constraints and information to the pipeline. Without this extra guidance, the model frequently failed to adhere to the required code structure or produce syntactically correct or executable code, which is a requirement for the GeCCo pipeline. In fact, when we attempted to run GeCCo with Llama-8B on more complex domains (e.g., planning or working memory), the generated code failed to compile (syntax errors) or did not follow the specified structure even with extra guidance.

3. Performance metrics and statistical tests

We thank the reviewer for suggesting these writing improvements. We computed BIC scores for each model on a per-participant basis, and then applied a paired-sample t-test to compare two models across participants, accounting for the within-subject design (i.e., each participant contributes one BIC score per model). We will include this information at the first mention of the t-test in the main text of the camera-ready paper.

4. The role of feedback

We suspect that the reviewer confused the title of Figure 11 (A), which currently is “learning: full feedback condition”, as referring to the feedback provided to the LLM to improve performance. We apologize for the confusing title and would like to clarify that the results in Figure 11 demonstrate iterative improvements of GeCCo-generated models in two experiments: two-armed bandit task with full feedback (Chambon et al. 2020) and multi-attribute decision making (Hilbig et al. 2014). To avoid further confusion, we will adjust the title of A to “two-armed bandit task with full feedback (Chambon et al. 2020)” and also refer to it in the main text of the camera-ready version.

Nevertheless, below we still provide our speculation regarding the role of feedback in GeCCo:

We think feedback was the strongest contributor to GeCCo's performance for three reasons: (1) it provides the best-performing model, selected based on BIC, as a reference. This is crucial as GeCCo can use it as a reference model and improve its predictive performance further; (2) it provides a list of parameter names used so far by GeCCo, whose size grows over iterations. GeCCo seems to be effectively making use of this information available in its context window to come up with new combinations that might successfully improve predictive performance. (3) The feedback also explicitly encourages exploration, which might prevent GeCCo from being stuck in a narrow space of models. Future work should conduct more targeted ablation studies to verify if these are indeed the reason why feedback plays such a crucial role in GeCCo.

5. R1 performance

We would appreciate it if the reviewer could clarify their question. Our current interpretation of the question is that it concerns the BIC differences between the baseline models and the R1 model, as presented in Figure 2B and Figure 4B. We would like to clarify that our claim regarding the superiority of the R1 model pertains specifically to the decision-making domain (Figure 2B), where the BIC difference is indeed statistically significant. In the case of planning (Figure 4B), we explicitly noted that while the average BIC of the R1-generated model is lower, the BIC differences were not statistically significant (although the exceedance probability favored the R1 model). Please let us know if we have misunderstood the question.

评论

Dear Reviewer xLM6,

Thank you very much for your review. We have addressed most of your concerns in our rebuttal above and believe our responses sufficiently justify an adjustment to the score, or at least warrant further clarification if deemed insufficient. We are happy to perform extra analyses or answer follow-up questions if you have any remaining concerns. Please let us know.

评论

Dear authors, thank you for your thorough review! I am impressed and you indeed have satisfied many of my outstanding concerns. I have decided to raise my review and will emphasize my support for the paper's acceptance in the Final Review.

评论

We are glad that our rebuttal addressed all your concerns and are happy to hear that you will support our paper's acceptance. Thank you for taking the time to provide your valuable comments!

审稿意见
5

This paper introduces the Guided generation of Computational Cognitive Models (GeCCo), a novel pipeline that leverages Large Language Models (LLMs) to automate the discovery of cognitive models. The framework prompts an LLM with task instructions, participant data, and a code template to generate candidate models in the form of Python code. These models are then fitted to data, and their predictive performance (measured by BIC) is used to provide iterative feedback to the LLM, guiding it toward better solutions. The authors test this approach across four distinct cognitive domains (decision-making, learning, planning, and memory), demonstrating that the LLM-generated models often match or outperform state-of-the-art, human-handcrafted models from the literature.

优缺点分析

Strengths

  • Significance and Originality: The paper addresses a significant and challenging problem in cognitive science: the time-consuming and expertise-dependent nature of computational model development. The proposed GeCCo framework is an original and innovative approach to this problem, effectively demonstrating how modern AI can be used as a tool for scientific discovery. The work is a compelling example of AI-assisted science and is likely to have a strong impact on the field.

  • Quality and Rigor: The methodological quality of the paper is significantly higher than your average "we prompted some LLMs to do stuff" papers. The GeCCo pipeline is well-designed, and the iterative feedback loop is a sound mechanism for guiding model discovery. The authors rigorously test their framework across four canonical and diverse cognitive paradigms, showing its versatility. The inclusion of multiple control experiments is a good too. The prompt ablations, data contamination analysis, and ground-truth recovery simulations provide strong evidence for the pipeline's robustness and help isolate the key drivers of its success. The comparisons are made against strong, well-established baseline models from the literature as well as CENTAUR, a foundation model of cognition, which provides an estimate of explainable variance.

  • Clarity: The paper is very well-written, clear, and easy to follow. The introduction does an excellent job of motivating the problem and situating the work within the existing literature. The GeCCo framework is explained clearly, and the schematic in Figure 1 provides an intuitive overview of the entire process. Providing the full prompts in the appendix significantly enhances the paper's reproducibility.

Weaknesses and Areas for Improvement

While the paper is strong overall, there are several areas where clarification and more nuanced framing would strengthen it.

  • Clarity of "Democratization" Motive: The paper suggests that a primary motivation is to "democratize and scale cognitive model development". This is an interesting goal, but its meaning and scope are not fully clarified. Cognitive modeling is a highly specialized task, and it is unclear who the target audience for this democratization is. This point would benefit from a more concrete explanation of what a "democratized" modeling ecosystem would entail and why it is a necessary or desirable goal for the scientific community.

  • Distinction Between Models and Theories: The paper occasionally makes strong claims about LLMs broadening the space of possible "theories" and generating "conceptually plausible theories". While the work clearly demonstrates the ability to generate novel models with high predictive accuracy, a model with a good fit is not the same as a new scientific theory. Models don't need to be correct, they just need to be useful. But the word "theory" imply causal mechanisms that must be rigorously tested for correctness.

  • Interpretation of Statistical Significance: Some of the key comparisons, while promising, are not statistically conclusive. In the comparison against the CENTAUR foundation model (Table 1), the LLM-generated models do not show a statistically significant performance difference in two of the four domains (Decision-making, p=0.78; Planning, p=0.18). The claim that LLM-generated models "matched or exceeded" CENTAUR's performance across domains should be qualified to reflect this.

问题

Q1: Does achieving a better fit (e.g., lower BIC score) equate to a better scientific theory? How does the GeCCo pipeline ensure that the generated models represent genuinely new, theoretically insightful cognitive hypotheses, rather than just being more complex or cleverly parameterized functions that find statistical regularities in the data without corresponding to a plausible cognitive process?

Q2: The prompt includes a Code template that often contains a baseline model from the literature. How can the authors be certain that the LLM is exploring a diverse model space, rather than being heavily anchored to the provided template? While the ablation study showed this was the 'weakest' predictor of performance, a weak effect is not a null effect.

Q3: The paper presents a data contamination analysis using LogProber and finds no evidence of leakage from the prompts. However, many of the discovered model components are well-established concepts in cognitive science (e.g., forgetting parameters, separate learning rates for positive/negative errors etc). How do the authors disentangle genuine, data-driven model discovery from the LLM simply reconstructing known modeling components from its vast pre-training data, which almost certainly includes the cognitive science literature?

局限性

Yes.

最终评判理由

The authors' response clarified the few issues and questions I had. Therefore, I maintain my score as Accept.

格式问题

No.

作者回复

Dear reviewer MxWn,

Thank you very much for your comments. We appreciate that you found our paper addresses a significant and challenging problem in cognitive science, and the GeCCo framework an original and innovative approach to this problem. We especially like that you believe our pipeline is a compelling example of AI-assisted science and is likely to have a strong impact on the field. Below, we discuss the specific concerns that you raised and how we have sought to address them. We believe that by addressing them we have further improved clarity and contribution of our paper.

  1. Clarified the Democratization Motive
  2. Addressed the difference between models and theories
  3. Addressed the dependence of GeCCo-generated models on the template
  4. Clarified the GeCCo-CENTAUR comparison
  5. Clarified the BIC interpretation and model quality assurance
  6. Addressed data-driven model discovery vs LLM's pre-trained knowledge

1. Democratization Motive

Currently, cognitive modeling requires extensive familiarity with literature, proficiency in programming, and a reasonably strong mathematical background, thus restricting participation to a relatively small community of experts. GeCCo creates a more democratized ecosystem by automating a technically demanding exercise like cognitive modeling, thereby broadening participation beyond traditional cognitive modeling communities.

We envision the primary beneficiaries as:

  • Experimental psychologists, including graduate students or junior researchers, who have conceptual insights but lack coding or modeling expertise, or would benefit from additional modeling ideas with a quick implementation.
  • Interdisciplinary researchers (e.g., neuroscientists, economists, behavioral ecologists) seeking cognitive models for hypothesis testing without diverting extensive resources into specialized methodological training.
  • Educators and students, for whom GeCCo can serve as a practical learning tool, facilitating hands-on experience with computational cognitive modeling early in their training.

We believe this democratization is desirable for the scientific community because it directly accelerates cognitive research by (a) increasing the speed and volume of hypothesis testing, (b) enabling a more diverse range of researchers to contribute to theory development, and (c) encouraging interdisciplinary collaboration.

2. Difference between models and theories

We agree that predictive accuracy alone does not establish a model as a scientific theory, and the term “theory” carries stronger commitments to causal explanation, mechanistic grounding, and empirical validation. We will revise the manuscript to avoid overstating claims and use the term “model” rather than “theory” throughout, except where we are referring to future theoretical implications. Additionally, we will incorporate the reviewer’s point in the limitations section (certain parts used verbatim): “While the work demonstrates the ability to generate novel models with high predictive accuracy, a model with a good fit is not the same as a new scientific theory. To discover theories, future work must rigorously test the causal role of the cognitive mechanisms underlying the GeCCo models through carefully designed experiments.”

3. Dependence of GeCCo-generated models on the template

We agree that a weak effect of the template doesn’t imply a null effect. To quantify the diversity of GeCCo-generated models relative to the provided code template, we embedded each cognitive model into a high-dimensional semantic space using the all-MiniLM-L6-v2 model from SentenceTransformers and computed the cosine distance between each generated model and its corresponding template, across all iterations and runs. We then ran a one-sample t-test over (1 – cosine distance) for each cognitive domain to assess whether the distance was significantly greater than 0. Across all four domains, we observed a significant semantic divergence from the original template. For more complex domains, such as Planning and Working memory, we observed higher semantic divergence from the original template (Planning: M = 0.24 ±\pm 0.01, t(35) = 39.79, p < .001; Working memory: M = 0.19 ±\pm 0.005, t(35) = 38.56, p < .001). For simpler task domains, like Learning and Decision making, the divergence was smaller but still reliably nonzero (Decision making: M = 0.07 ±\pm 0.002, t(149) = 32.18, p < .001; Learning: M = 0.16 ±\pm 0.005, t(83) = 34.79, p < .001), possibly reflecting either task simplicity or implicit instruction-following by the model. These results suggest that GeCCo explores a diverse space of models, particularly when task complexity demands richer solutions.

We also conducted 2 additional experiments to further explore the effect of the template:

  1. Auto-generated code template: We modified the GeCCo pipeline to allow the LLM (LLaMA-3.1 70B) to generate an initial code template using a more detailed natural language description of the function specification. We applied this version to two cognitive domains (learning, decision making) and found that the best model discovered in each case, in terms of BIC, closely matched the performance of the models from the original pipeline using hard-coded templates (learning: M = 77.89 ±\pm 5.57 vs. 79.77  ±\pm 6.29, t(15) = 0.90, p = 0.384; decision-making: M = 39.30  ±\pm 4.59 vs. 39.30  ±\pm 4.59, t(52) = -1.29, p = 0.202).

  2. Template-free reasoning model: We also ran the R1 reasoning model without any code template at all, providing only formatting constraints and the task description in natural language. On the decision-making task, GeCCo produced a model with comparable performance to the baseline model from literature (M= 53.7 ±\pm 5.56; t(52) = 0.48, p = 0.31), but worse compared to the GeCCo-generated model with a template (t(52) = 3.85, p = 0.0001).

The results suggest that in the future iterations of GeCCo we could replace the fixed template with auto-generated one, and even loosen constraints on the model structure and formatting in prompts.

4. GeCCo-CENTAUR comparison

To clarify, the goal of this analysis was not to show that GeCCo outperforms CENTAUR, but to estimate how much explainable variance in behavior each model captures. We treat CENTAUR's predictive performance as a reasonable proxy for the total explainable variance; see the control experiments section for details.

We acknowledge that the two-sided t-test we used only tests for a mean difference in negative log-likelihood (NLL) between GeCCo and CENTAUR, not for equivalence, so it can't determine whether GeCCo matches or exceeds CENTAUR. To address this, we computed the Bayes factor (BF) to quantify evidence for or against a performance difference:

  • Memory: log10(BF) = 3.52. Strong evidence for a difference; since CENTAUR NLL is higher, the GeCCo is decisively better.
  • Learning: log10(BF) = 8.96. Strong evidence for a difference; CENTAUR NLL is once again higher, so GeCCo is decisively better.
  • Decision making: log10(BF) = -0.81. Moderate evidence for no difference; GeCCo NLL is marginally lower, but the BF points to no clear difference between GeCCo and CENTAUR.
  • Planning: log10(BF) = -0.23. Weak evidence for no difference; CENTAUR NLL is lower, but the evidence is weak/inconclusive.

These results support our statement that GeCCo generates cognitive models that match or outperform CENTAUR in predictive accuracy.

5. BIC interpretation and quality assurance

To ensure that the GeCCo pipeline does not generate theoretically invalid, overly complex, or overfitted models that merely capitalize on statistical regularities, we employed a multi-faceted evaluation strategy:

  • BIC Evaluation: We used BIC to penalize overly complex models. In all domains tested, GeCCo-generated models had fewer parameters (N) than the best-performing baseline models:
    • Decision Making: GeCCo N=2, baseline N=5
    • Learning: GeCCo N=5, baseline N=6
    • Planning: GeCCo N=3, baseline N=7
    • Working Memory: GeCCo N=5, baseline N=7
  • We tested model performance on held-out data to reduce the chance that results reflect dataset-specific statistical regularities.
  • Posterior Predictive Checks: We simulated model behavior in contexts specifically aligned with the experimental designs used to probe the target cognitive processes. This helps rule out overfitting to incidental data features and ensures the models generalize in cognitively meaningful ways.
  • Plausibility checks: Experts in cognitive modeling reviewed each GeCCo-generated model to ensure theoretical soundness, including the use of meaningful cognitive parameters in processes like forgetting, value updating, and exploration.

Regarding genuine novelty, most GeCCo-generated propose novel combinations of well-established cognitive mechanisms that, when integrated, yield strong generative performance and data fit and as such are scientifically valuable (see “Atypical Combinations and Scientific Impact” by Uzzi et al., 2013).

6. Data-driven discovery vs LLM pre-trained knowledge

Although our ablation experiments confirmed that participant data significantly influences the performance of GeCCo-generated models (β = -12.84, p < 0.05; see the ablation study in Section 7), it is challenging to precisely quantify the extent to which GeCCo-generated models reflect genuine data-driven discovery versus leveraging pre-trained conceptual knowledge. Disentangling their contributions remains an open and important future direction. One promising direction for future work—not yet explored in our current manuscript—would be to perform RL-based fine-tuning of LLMs directly on participant-level behavioral data, thereby explicitly encouraging cognitive model generation driven by empirical observations while still building on pre-trained conceptual knowledge. We will include this point in the discussion section of the camera-ready version.

评论

Thank you for your response, this clarifies and resolves all the issues I had.

  • "... perform RL-based fine-tuning of LLMs directly on participant-level behavioral data, thereby explicitly encouraging cognitive model generation driven by empirical observations while still building on pre-trained conceptual knowledge." This is quite an interesting proposition. At that point, the pre-trained model would act as a prior. For instance, a certain participant's decision-making behaviour (or even the same participant at a different time) can be explained better via hyperbolic discounting whereas other via exponential. If fine-tuning can essentially learn to do this personalized inference online just from the context window, this would speed up cognitive personalisation.
评论

We are glad that our rebuttal answered all your questions. We agree with your intuition and consider RL-based finetuning for cognitive model discovery to be a promising future direction. Thank you!

评论

Dear Reviewers,

The responses from the authors are available. Please read them to see if you have any further question.

Thank you for your support.

AC.

最终决定

This paper dives into the automatic discovery of computational cognitive models, which are previously hand-crafted by researchers. LLMs with iterative feedback is developed to achieve this. The experiments are conducted on four scenarios: 1. Decision Making, 2. Learning, 3. Planning, 4. Working Memory. The authors find the discovered models have good performance.

Strengths

  1. The LLM-based framework is novel in the domain of computational cognitive model. The including of task instructions, participant data, and a code template is well-motivated. The iterative feedback loop is sound for getting better models.

  2. The studies problem of automating computational cognitive models is important, having implications for cognitive science.

  3. The tests are conducted on four scenarios, showing its good performance compared with human-handcrafted models and its findings in interpretable models.

Weakness

  1. The technical novelty is somewhat limited from the methodological perspective, as there are already some methods proposed in scientific discovery which share similar spirits. Although the authors summarize the difference with the Li et al. paper, it is better to have a discussion with the LLM-based methods used in scientific discovery from a broader perspective.

  2. This paper does not provide distinction between models and theories. The terms should be used correctly.

  3. Only the metric BIC is used to fit to human data in the original paper. The authors add other metrics in the rebuttal.

  4. The testing scenarios are somewhat simple and the performance in a few cases are not very significant.

All the four reviewers give acceptance recommendation to this paper. In summary, although this paper still has some limitations, this is a good paper, particularly for the computational cognitive science domain.

Most of the reviewers think the author feedback is satisfactory and maintain their positive ratings.