PaperHub
8.2
/10
Spotlight4 位审稿人
最低5最高5标准差0.0
5
5
5
5
3.3
置信度
创新性3.3
质量3.0
清晰度3.3
重要性2.8
NeurIPS 2025

Personalized Decision Modeling: Utility Optimization or Textualized-Symbolic Reasoning

OpenReviewPDF
提交: 2025-05-07更新: 2025-10-29

摘要

关键词
Personalized Decision MakingSymbolic RegressionLLM

评审与讨论

审稿意见
5

This work proposed a neural-symbolic approach to personalized decision making. It employs a two-stage procedure:

  • Group-level symbolic utility discovery, a procedure that iteratively identifies a symbolic expression for utility prediction. This procedure terminates when a convergence criterion (defined as the best-worst optimality gap) is met.
  • Individual-level semantic adaptation, which iteratively updates the semantic template via a TextGrad-esque procedure to optimize for individual utility losses.

In sum, the first stage amortizes (or warm-starts) symbolic expressions per group, and then the second stage adapts this expression to individuals. Experimental results show impressive gains over purely LLM-based and predictive model baselines.

优缺点分析

Strengths

  • Quality and Clarity: The writing is generally in good shape. The authors did a good job to present preliminary concepts, mathematical formalism, and their LLM implementations in detail. I appreciate the ablation study which shows that both components of the authors' approach are integral to the final performance.
  • Significance: As LLMs are increasingly embedded into daily usages, it is important to assess their capabilities and limitations. The authors have shown that LLMs are not yet capable as an end-to-end system for personalized decision making, and have proposed means to improve their performance. I believe this is an important area for future research
  • Originality: I'm a fan of neural-symbolic approaches wherein an LLM functions as a "central hub" to process and reason with textual / symbolic information, while delegating more quantitative operations to specialized packages. The authors have demonstrated a good application of this approach in personalized regression.

Weaknesses

  • Extensive usage of LLMs is stated as a limitation to the approach, and it'd be good to report token consumptions (or monetary costs) associated with each run.
  • There are quite a few hyperparameters involved in this approach, such as the optimality gap δ\delta and the number of iterations T,TT, T^\prime. I could not find these numbers within the main text nor the appendix. It'd be good for the authors to report them as they're crucial to the understanding of the overall complexity of this approach.
  • The literature review section is overall well-done, but the section on "LLM-based Decision-Making Models" could've done a better job at interfacing with works that specifically addresses decision making via LLMs (right now it focuses more on general reasoning and instruction-following capabilities). For example, [1][2] studies decision making under uncertainty with LLMs, and [2][3][4] study their applications in supply chain optimization, personalized medicine, and autonomous driving respectively.
  • A benefit of symbolic regression is their interpretability. It would be good to discuss some of the expressions and explain why it makes sense.

[1] DeLLMa: Decision Making Under Uncertainty with Large Language Models

[2] STRUX: An LLM for Decision-Making with Structured Explanations

[3] Large Language Models for Supply Chain Optimization

[4] Leveraging Large Language Models for Decision Support in Personalized Oncology

[5] A Language Agent for Autonomous Driving

问题

See weaknesses, and also:

  • I'm not entirely familiar with the Swissmetro and the Vaccine datasets, but the authors have elected for a simplified version of them for budget control. Could the authors discuss whether these simplified variants maintain the core challenge of this problem, and what the normal problem instances would look like?
  • I'd be curious to see the scaling behaviors of this approach. Say you increase the number of iterations for both stages, would you get better results?

局限性

yes

最终评判理由

I thank the authors for their detailed response. I have also reviewed their responses to other reviewers and adjusted my score accordingly.

格式问题

the paper is appropriately formatted.

作者回复

We thank you for your time and effort in reviewing our paper! We find your suggestions very helpful and we hereby address your questions:

W1: Complexity, inference time, and token / cost footprint

We sincerely thank the reviewer for highlighting this point. Here we performed an additional experiment that tracks the number of the token consumed and estimated cost, which is reported below.

All runs use GPT-4o-mini (128 k ctx.; 0.15/Minputtok.,0.15 / M input tok., 0.60 / M output tok.).
Costs below list input → output fees. Pricing source: OpenAI docs (July 2025).

Stage 1 Symbolic search

DatasetT (iterations)Total tok.Wall-time (min)Cost (USD; input → output)
Swissmetro519 8713.70.0030.003 → 0.012
      1593 25615.90.0140.014 → 0.056
      30548 81173.80.0820.082 → 0.329
Vaccine515 0003.00.0020.002 → 0.009
1570 00012.50.0110.011 → 0.042
30410 69257.50.0620.062 → 0.246

Stage 2 TextGrad refinement

DatasetT′ (iterations)Total tok.Wall-time (min)Cost (USD; input → output)
Swissmetro11 1971.390.0000.000 → 0.001
      34 0743.550.0010.001 → 0.002
      56 9565.810.0010.001 → 0.004
Vaccine19620.220.0000.000 → 0.001
      33 1012.340.0000.000 → 0.002
      55 5364.710.0010.001 → 0.003

Take-aways

  • Stage 1 dominates both tokens and time; Stage 2 is negligible (< 7 k tokens, < 6 min).
  • Even the heaviest configuration (30 + 5 iters) costs < $0.35/run, very affordable.
  • Token growth matches our theoretical O((KT+NT))\mathcal{O}\bigl((K T + N T′)\bigr) prediction; adding more threads lets us parallel-slice Stage 1 to reduce wall-time.
  • Since LLM outputs are not that stable nad can occasionally drift from instructions (e.g., formatting glitches), we should consider the budget for automatic retries; the table above is just for reference, actual costs fluctuate and are model-dependent.

W2: Value range of hyperparameters

Thank you for pointing out that several run-time hyper-parameters were not explicitly reported. We have added the following table for clarity:

SymbolDescriptionValue
LLMs UsedThe specific large language models utilized for both the symbolic discovery and semantic adaptation stages of ATHENA.gpt-40-mini-2024-07-18 and gemini-2.0-flash
TThe number of iterations for the group-level symbolic utility discovery process, as shown in the accuracy trajectory analysis.30
TT'The maximum number of iterations for the individual-level semantic adaptation.5
KKThe number of candidate symbolic utility functions sampled in each group-level iteration.10
δ\deltaThe predefined convergence threshold for the group-level discovery process./

W3: LLM-based Decision-Making Models

Thank you for drawing our attention to the recent work that applies LLMs directly to decision making under uncertainty! We have carefully read the papers you mentioned (e.g., DeLLMa [1], STRUX [2], and the follow-up studies on supply-chain optimization, personalized medicine, and autonomous driving [3–4]) and found them extremely insightful. We will incorporate these contributions into the final version, expanding the “LLM-based Decision-Making Models” subsection to include those works and discuss the complementary design choices each line of work makes. We greatly appreciate the pointer and will keep a close eye on the continued progress of these research groups.

W4: Qualitative Analysis for Formulas in Main Text

Thanks for your suggestions to enhance the qualitative analysis of the formulas presented in the main text! We have discussed about some representative examples in the rebuttal section of Reviewer HnWo.

Q1: Dataset, Subset Selection Concern

We thank the reviewer for raising this insightful question. We only down-sampled via stratified sampling; every feature and label stays, so the joint distribution matches the originals.

Q2: Scaling Behavior

Stage 1 – Group-level symbolic discovery
The scalability of Stage 1 is shown in Table2 of our paper: as we extend the symbolic-search loop from 0 → 30 iterations, performance gains steadily increase. But it will reach a threshold where further iterations will not yield significant improvements.

Stage 2 – Individual-level semantic adaptation

Experimental setup. We selected 5 individuals from each dataset and ran Stage 2 for 1, 5, 10, and 15 iterations. The results are summarized in the tables below. All scaling runs use gpt-4o-mini as base model via OpenAI API (July 2025 release) for Stage 2. For each experiment T (Stage 1) is set to 30.

Swissmetro (5 individuals)

T′AccuracyF1AUC
10.900.31110.50
50.900.31110.50
100.900.31110.50
151.000.33330.50

Vaccine (5 individuals)

T′AccuracyF1AUC
10.400.20000.50
50.600.31110.50
100.500.17780.50
150.600.24440.50

Key observations

  1. Fast saturation. For Swissmetro, metrics stabilise after 5 iterations, with only a slight bump at 15. Vaccine peaks at 5–15 iterations, with no monotonic gains thereafter.
  2. Modest Stage‑2 gain. The largest improvement over the 1‑iteration baseline is +0.10 accuracy, confirming that Stage 1 contributes the bulk of predictive power.
  3. Practical default. Capping Stage 2 at 5 iterations captures > 95 % of the attainable benefit while keeping extra cost (< 7 k tokens) negligible.
评论

We would like to briefly follow up on your comment regarding formula interpretability. In case our previous response did not fully address your concerns, we will further expand the qualitative discussion. Below are a few additional representative cases from the Swissmetro and Vaccine datasets that we plan to include in the final version.

  • Representative Example 2 — Swissmetro Dataset
ModeDiscovered symbolic utility
TrainK * purpose + K * abs(payer_type * C) - K * first_class + K * abs(luggage) * sqrt(abs(age + C)) + K * abs(train_time + C) + K * log(income + C) - C
CarK * abs(car_time + C) + K * abs(car_time - train_time + C) - K * abs(car_cost + train_cost + C) + K * abs(headway) * sqrt(income + C) + C
MetroK * abs(metro_time + C) + K * abs(metro_cost + C) + K * sqrt(abs(age + C)) + K * log(exp(income + C) + C) - C

Feature: male travelers < 24 yrs, annual income 50–100 k.

  • Key take-aways for domain experts
InsightInterpretation & policy relevance
Time still trumps moneyAll modes include travel time, but only Car and Metro include fare. For under-25 travelers, an extra minute on the road matters more than an extra franc in fare, so speeding up transfers or giving signal priority is especially worthwhile [1].
Headway frustration fuels car useK * headway * sqrt(income) sits inside Car utility: longer gaps between trains make driving relatively more attractive, and the square-root income interaction means irritation increases modestly with disposable income. Thus, high-frequency rail service directly counteracts the tendency of young people to switch to cars [2].
First-class indifferenceThe explicit penalty K*first_class implies this cohort rarely values upgrades; they are price, not comfort–motivated. Bundling Wi-Fi or gaming lounges into standard class may convert more riders than pushing premium seats [3].

[1] Frank, L., Bradley, M., Kavage, S., et al. (2008). Urban form, travel time, and cost relationships with tour complexity and mode choice. Transportation, 35(1), 37–54.

[2] Liao, Y., Gil, et al. (2020). Disparities in travel times between car and transit: Spatiotemporal patterns in cities. Scientific Reports, 10(1), 4056.

[3] Marques, J., Gomes, et al. (2025). Generation Z and Travel Motivations: The Impact of Age, Gender, and Residence. Tourism and Hospitality, 6(2), 82.

  • Representative Example 2 — Vaccine Dataset
ModeDiscovered symbolic utility
UnvaccinatedK * sqrt(covid_threat * (risk_of_covid_greater_than_vax+(sqrt(age) * gender + C))) * ((trust_government * trust_science)**2 + C) + K * more_attention_to_vax_info - K * (income_below_median * have_university_degree * (trust_government * trust_science))) + C
Vaccinated (no booster)K * (vaccine_safe_to_me + (trust_government * trust_science * sqrt(sqrt(age + C) + income_unknown + C)) + C) + K * (more_attention_to_vax_info / (less_attention_to_vax_info + C)) + C
BoosterK * (have_covid_sick_family_member + (physician * (trust_government * trust_science * (sqrt(age + C) + C))) * (nurse * (trust_science * sqrt(age + C))) + C) + C
  • Key take-aways for domain experts
InsightInterpretation & policy relevance
Information attention as leverPositive coefficients on more_attention_to_vax_info show that increasing engagement with vaccine information consistently raises utility across choices. Interactive, accessible information campaigns remain a cornerstone intervention [4].
Nonlinear trust amplificationThe factor (trust_government * trust_science)^2 in the unvaccinated utility underscores a super-additive effect of institutional trust: small increases in both trust dimensions can disproportionately raise vaccine consideration among lower-income, middle-aged adults. Building dual-track trust campaigns could be highly effective [5].
Education buffers income hesitancyThe negative – K*income_below_median term is attenuated by the interaction have_university_degree*trust_government*trust_science, indicating that higher education can offset income-related hesitancy when paired with institutional trust. Policy interventions might focus on educational outreach leveraging trusted science communicators [6].

[4] Glanz, Jason M., Nicole M. Wagner, Komal J. Narwaney, et al. "Web-based social media intervention to increase vaccine acceptance: a randomized controlled trial." Pediatrics 140, no. 6 (2017): e20171117.

[5] Lohmann, Sophie, and Dolores Albarracín. "Trust in the public health system as a source of information on vaccination matters most when environments are supportive." Vaccine 40, no. 33 (2022): 4693-4699.

[6] Bajos, Nathalie, Alexis Spire, et al. "When lack of trust in the government and in scientists reinforces social inequalities in vaccination against COVID-19." Frontiers in public health 10 (2022): 908152.

评论

Dear Authors,

I have indeed seen the interpretability part from your response to other reviewers, and I'm satisfied with the content. Good luck with your rebuttal :-)

评论

Dear Reviewer zWuZ,

Thank you very much for your constructive feedback and for highlighting several important recent works on LLM-based decision making. The papers you brought to our attention are highly relevant, and we greatly appreciate you pointing them out—they have deepened our perspective on the field and will be discussed in our revised manuscript.

We are also grateful for your engagement throughout the process, including the discussion on the interpretability analysis and scaling behaviors. Your thoughtful comments have clearly strengthened our work!

If you have any further questions or suggestions, we would be happy to continue the discussion. Thank you again for your valuable support!!

Best regards, The Authors

审稿意见
5

The authors introduce ATHENA that proposes to integrate group-level symbolic regression of utility functions with individual level, LLM-powered semantic modeling to offer a more comprehensive and personalized view of decision making behavior. The key innovation with this approach is the integration of symbolic utility modeling with LLM-powered semantic reasoning to create more accurate and interpretable personalized decision models. ATHENA is evaluated in two decision making tasks--travel mode and vaccine uptake--, and results show that it outperformed traditional utility-based models, machine learning approaches, and other LLM based models.

优缺点分析

The presented idea would inspire other researchers in the field. While the two components, symbolic utility modeling and LLM-based semantic adaptation, both exist separately, their integration for personalized decision making is novel. This focus on individual-level adaptation aligns with every-increasing interest in personalized AI systems that can account for individual differences.

But the testing of ATHENA appears limited in scope because they only used two lightweight models: GPT-4o-mini and Gemini-2.0-flash. While these models are practical choices, the experiment section would be stronger if authors explained the rationale behind choosing these specific models. For example, GPT-4o-mini is considered to be strong at reasoning tasks, so it would be valuable to compare it against models known for different strengths. Testing with only two models provides limited view of how ATHENA's performance varies across different types of LLMs. Testing with at least one large model would help understand how model size affect performance, and whether the additional computational costs of larger models are justified by improved results.

问题

What was the reason testing with GPT-4o-mini and Gemini-2.0-flash? Was there a hard computational constraint on using large models? What other models were considered for testing?

局限性

Yes

最终评判理由

I thank the authors on clarifying their model choices. I adjusted my rating from 4 to 5.

格式问题

None

作者回复

We thank you for your time and effort in reviewing our paper! We appreciated for noting that “integrating symbolic utility modeling with LLM-powered semantic adaptation for personalized decision-making is novel and will inspire future work.” Your recognition of ATHENA’s ability to deliver accurate, interpretable, individual-level utility models is highly encouraging.

We are also gratified that the other reviewers independently highlighted the same strengths. Reviewer F7TL praised the “clear motivation, technically sound two-stage design” as well as our visualizations and writing clarity. Reviewer HnWo called the approach “very interesting and original,” commending the sound methodology and thorough ablations. Reviewer zWuZ valued the detailed formalism, the importance of combining LLMs with symbolic methods, and the originality of our neural-symbolic framework.

Together, the reviewers’ comments affirm that ATHENA makes a timely and impactful contribution to human-centric, interpretable personalized decision modeling.

Regarding your concerns, we added below experiments:

  • End-to-end models: BERT, TabNet, MLP.
  • Additional LLMs: Qwen3-32b, DeepSeek-R1-Distill-Qwen-32B, GPT-4o.

Results: Please see the table below. All new runs use a randomly selected subset (size of 100 individuals) from the dataset described in §4.1.

CategoryMethod / SettingLLM ModelSwiss AccSwiss F1Swiss CESwiss AUCVaccine AccVaccine F1Vaccine CEVaccine AUC
LLM-BasedZeroshotgemini-2.0-flash0.58000.30460.90590.68290.60000.53860.83170.7433
GPT-4o-mini0.61000.27630.92530.55560.55000.53020.82710.7500
GPT-4o0.59000.33100.86460.69460.60000.54650.80520.7306
Qwen30.59000.41581.40470.61260.54000.57290.86760.7519
DeepSeek-r10.59000.43390.66081.24730.63000.65310.82440.7692
Zeroshot-CoTgemini-2.0-flash0.53000.28090.98580.64090.61000.54850.81280.7604
GPT-4o-mini0.61000.27630.91090.61560.59000.58200.81610.7714
GPT-4o0.58000.31620.92370.64040.63000.57170.77850.7700
Qwen30.62000.46321.71010.62840.54000.58430.90730.7364
DeepSeek-r10.58000.30180.95220.61730.52000.54920.87610.7373
Few-shotgemini-2.0-flash0.72000.692210.09220.79840.53000.550814.25310.6747
GPT-4o-mini0.64560.50858.07270.75290.52000.52787.70660.6975
GPT-4o0.73000.66543.93860.84230.57000.59675.62610.7467
Qwen30.74000.67607.04350.80330.54000.54879.50130.7060
DeepSeek-r10.70000.62823.61540.84390.55000.53928.88950.6855
TextGradgemini-2.0-flash0.54000.24321.19340.47180.50000.45114.12900.7345
GPT-4o-mini0.57000.31110.95510.52920.50000.46864.79600.6460
GPT-4o0.56000.33160.94410.62560.56000.53212.34680.6721
Qwen3--------
DeepSeek-r10.58000.33560.96310.63440.43000.42353.10800.5948
ATHENA (ours)gemini-2.0-flash0.78500.72190.81000.90630.65000.59780.83050.7998
GPT-4o-mini0.74670.70971.08490.86240.65000.60790.80340.8133
GPT-4o0.76500.71210.86450.87600.67000.62130.77650.8279
Qwen30.74510.71580.82110.75260.59290.58851.16590.7777
DeepSeek-r10.74500.67590.80780.84610.66000.65010.81150.8212
Utility-BasedMNL/0.61010.38870.83530.70740.41500.19551.05100.4301
CLogit/0.57140.24240.89160.59760.41500.19551.05100.5000
Latent Class MNL/0.61010.39670.81750.71820.19500.10881.09860.5000
Machine LearningLogistic Regression/0.56200.55700.93100.74600.65000.66900.76300.8330
Random Forest/0.71000.70500.73800.88100.63000.64700.72900.8420
XGBoost/0.70800.70500.70400.88100.63000.64801.14200.8150
BERT (finetuned)/0.69600.47700.69170.91210.64000.65000.80910.8074
TabNet/0.71540.73010.67230.91900.64000.63060.79800.8707
MLP/0.60770.59820.80470.80970.64960.63350.83280.8069

Note. Qwen3‑32B TextGrad baseline experiments are still running; we will publish the final numbers and update the table within the next 1–2 days.

We evaluate ATHENA against two state-of-the-art open-source models (Qwen3-32b and DeepSeek-R1-Distill-Qwen-32B) and three leading commercial offerings (GPT-4o-mini, GPT-4o, and Gemini-2.0-Flash). Across both tasks, ATHENA achieves superior performance—recording the highest accuracy, F1 score, and AUC of all baselines. To further broaden our comparison, we also include three additional benchmarks: BERT, TabNet, and a multilayer perceptron (MLP).

Why GPT-4o-mini & Gemini-Flash were the original choices

  1. Practical adaptation cost: Individual-level TextGrad Adaptation requires hundreds of forward passes per person. Inferencing large open-source models like Qwen3 or DeepSeek-r1 locally or stronger models like GPT-4o via API is prohibitively expensive. GPT-4o-mini and Gemini-Flash are smaller, more efficient models that still provide strong reasoning capabilities.
  2. Representative reasoning strength: GPT-4o-mini is explicitly tuned for chain-of-thought tasks, while Gemini-Flash is optimized for low-latency inference. Together they bracket a typical speed-versus-reasoning trade-off faced in practice.
评论

Thank you for your patience while the Qwen3-32B TextGrad runs finished.
The results now can be inserted into above table:

CategoryMethod / SettingLLM ModelSwiss AccSwiss F1Swiss CESwiss AUCVaccine AccVaccine F1Vaccine CEVaccine AUC
LLM-BasedTextGradQwen30.51000.36692.01800.53220.46000.42766.49940.6303

Key take-away. After adding the TextGrad baseline of Qwen3-32B, ATHENA remains the top performer, reinforcing our main conclusion that coupling symbolic utility modeling with semantic adaptation yields the most accurate and interpretable results.

评论

Dear Reviewer FBCn,

Thank you for suggesting we broaden our evaluation—your input has clearly strengthened our work!

In response, we have expanded our experiments by adding three end-to-end baselines (BERT, TabNet, and MLP) as well as three additional LLMs (Qwen3-32b, DeepSeek-R1-Distill-Qwen-32B, and GPT-4o). All models were evaluated on the same 100-person subset (see the table above), and ATHENA continues to lead in accuracy, F1 score, and AUC.

We’ve also clarified why we initially chose GPT-4o-mini and Gemini-2.0-Flash, emphasizing their balance of reasoning strength and inference efficiency.

We hope these additions address your concerns and provide a more comprehensive view of ATHENA’s performance. We look forward to your feedback and would be happy to learn from your further comments and suggestions!

Sincerely, The Authors

审稿意见
5

This paper studies the problem of modeling human behaviour in complex domains where peoples idiosyncratic preferences may cause them to deviate from the global optima. As a solution to this problem the authors propose a two stage approach where first a global utility function is learned and then this function is tuned to each individuals unique characteristics. Both of these stages are implemented with the help of an LLM. They validate this approach on real world data showing this approach is performative and that both stages are important for success.

优缺点分析

Strengths

  • This is a very interesting and original idea
  • The methodology seems sound
  • The paper is well written and easy to follow
  • The authors make a strong effort to ablate and interpret their results

Weaknesses

  • The baselines compared against seem a little weak. Is there a reason why one couldnt train and end to end neural network?
  • It would have been nice to see an ablation on the LLM component of the model. How much is the model benefiting from an LLM versus some other sampling procedure in the evolutionary algorithm and is it worth the additional computation cost

问题

In the experiment which tests the power of Individual-Level Semantic Adaptation alone, is it given a completely random starting point or the highest performing starting point out of a number of samples comparable to the amount that stage 1 generates? It seems unfair to attribute performance to step 1 that could be attributed to simply taking the best of N samples.

One of the primary goals of utility models, typically, is not to predict behaviour but to understand it. I see that in your fragment analysis some pieces of the predictive model are commonly reused but are any of the full models outputted by this procedure interpretable? Are the global functions interpretable?

Have you tried any more powerful models? In particular I would be interested in whether reasoning models show improved performance.

局限性

yes

最终评判理由

The authors rebuttal cleared up all of my concerns for the paper including comparing against stronger benchmarks and Ive therefore decided to bump my score to an accept.

格式问题

no concerns

作者回复

We thank you for your time and effort in reviewing our paper! We find your suggestions very helpful and we address all your questions/comments as follows:

W1/Q3: End-to-End Baseline Comparison & More Powerful Base Models

Thank you for the suggestions of adding end-to-end baselines and more powerful base models. We've been paying close attention to new cutting-edge models and have supplemented our experiments accordingly.

  • End-to-end models: BERT, TabNet, 3-layer-MLP.
  • Additional LLMs: Qwen3-32b, DeepSeek-R1-Distill-Qwen-32B, GPT-4o.

Please see the updated results in the table published in rebuttal section of Reviewer FBCn

Q2: Explainability of ATHENA utilities

We agree that interpretability is crucial and confirm that ATHENA yields fully interpretable, end-to-end utility functions. Below is two representative segments from the Swissmetro and Vaccine dataset.

  • Representative Example — Swissmetro Dataset
ModeDiscovered symbolic utility
TrainK·(train_time + metro_time + luggage·log(age + 1) + age + is_male) + C·(first_class + income) − C·(GA_pass + headway)
CarK·(car_time + train_time + luggage·log(age + 1) + age) + C·(first_class + income) − C·(GA_pass + metro_fare + is_male)
MetroK·(metro_time + luggage + age + is_male) + C·(first_class + income) − C·(headway + GA_pass + is_male)

Feature: Between 39 and 54 years old, identify as female, and have an income between 50 and 100.

  • Key take-aways for domain experts
InsightInterpretation & policy relevance
Time dominatesLarge negative coefficients on travel-time variables show this segment is highly time-sensitive → investments that shorten door-to-door time (e.g., skip-stop service) should shift demand [1].
Comfort premiumPositive weight on (first_class + income) across all modes indicates a willingness to pay for comfort that scales with income → targeted upselling (seat reservations, quiet cars) is effective [2].
Luggage burden grows with ageThe interaction luggage·log(age + 1) reveals baggage becomes disproportionately painful for older travelers → facilities such as luggage trolleys or porter services may raise train/metro share [3].
GA pass effectOwning a GA pass (the Swiss national annual travel pass) biases travellers away from modes that still incur extra fares (e.g. Car, premium metro segments). Extending GA coverage to Swissmetro would therefore raise its relative appeal [4].

These insights translate raw coefficients into specific levers for service design and policy -- exactly the actionable value the reviewer is seeking. Taken together, these conclusions are highly valuable for practitioners and align closely with the extensive body of prior research on travel behavior and mode choice.

[1] Shires, Jeremy D., and Gerard C. De Jong. "An international meta-analysis of values of travel time savings." Evaluation and program planning 32, no. 4 (2009): 315-325.

[2] Abrantes, Pedro AL, and Mark R. Wardman. "Meta-analysis of UK values of travel time: An update." Transportation Research Part A: Policy and Practice 45, no. 1 (2011): 1-17.

[3] Chang, Yu-Chun. "Factors affecting airport access mode choice for elderly air passengers." Transportation research part E: logistics and transportation review 57 (2013): 105-112.

[4] Weis, Claude, Kay W. Axhausen, Robert Schlich, and René Zbinden. "Models of mode choice and mobility tool ownership beyond 2008 fuel prices." Transportation Research Record 2157, no. 1 (2010): 86-94.

  • Representative Example — Vaccine Dataset
ModeDiscovered symbolic utility
UnvaccinatedC·covid_threat·(1 + trust_government·trust_science·log(age + 5))·risk_of_covid_greater_than_vax + K·have_covid_sick_family_member·log(age + 4)
Vaccinated (no booster)C·covid_threat + C·vaccine_safe_to_me + K·(trust_government·trust_science·more_attention_to_vax_info·√(age + 4))
BoosterC·e^{age^1.5}·covid_threat·√(vax_protect_long_yes) + C·vaccine_safe_to_me + K·(trust_government·trust_science·nurse·√(age + 7))

Feature: Age 18–38, income above county median.

  • Key take-aways for domain experts
InsightInterpretation & policy relevance
Risk trade-off in vaccination choiceThe product covid_threat × risk_of_covid_greater_than_vax captures a critical decision-making trade-off: individuals weigh the perceived risk of infection against their belief about vaccine safety. Effective messaging must work to narrow this perceived risk gap, for instance by emphasizing robust evidence on vaccine safety and the serious consequences of infection [5].
Booster demand rises steeply with ageThe factor eage1.5e^{age^{1.5}} generates a nonlinear age effect: as age increases, the perceived benefit of taking the vaccine grows rapidly. This pattern likely reflects age-associated increases in risk perception and underlying health vulnerabilities [6].
Prior belief and healthcare occupationThe presence of vax_protect_long_yes and nurse occupation in the booster equation means emphasizing extended protection and occupation will push this group further along the vaccination ladder [7].
Trust is pivotal for vaccine uptakeThe multiplicative trust_government × trust_science term appears in every vaccinated utility, signalling that confidence in both institutions—not just one—amplifies willingness [8].

These findings convert raw symbolic coefficients into tailored interventions: age-targeted messaging, trust-building coalitions, and occupation-based booster incentives.

[5] Green, Manfred S. "Rational and irrational vaccine hesitancy." Israel Journal of Health Policy Research 12, no. 1 (2023): 11.

[6] Noh, Yunha, Ju Hwan Kim, Dongwon Yoon, Young June Choe, Seung-Ah Choe, Jaehun Jung, Sang-Won Lee, and Ju-Young Shin. "Predictors of COVID-19 booster vaccine hesitancy among fully vaccinated adults in Korea: a nationwide cross-sectional survey." Epidemiology and Health 44 (2022): e2022061.

[7] Biswas, Nirbachita, Toheeb Mustapha, Jagdish Khubchandani, and James H. Price. "The nature and extent of COVID-19 vaccination hesitancy in healthcare workers." Journal of community health 46, no. 6 (2021): 1244-1251.

[8] Trent, Mallory, et al. "Trust in government, intention to vaccinate and COVID-19 vaccine hesitancy: A comparative survey of five large cities in the United States, United Kingdom, and Australia." Vaccine 40.17 (2022): 2498-2505.

W2: Ablation on the LLM component

Thank you for the suggestion regarding an ablation of the LLM component. We appreciate the opportunity to clarify the role and necessity of the LLM in our framework. 1) Our LLM-based utility search relies on two distinct libraries: a Symbolic Library, which contains standard algebraic operators, and a Concept Library, which includes semantically rich, high-level terms derived from domain-specific text. The Concept Library is only accessible through the LLM. 2) Ablating the LLM would not simply remove a component of the model; it would eliminate access to the Concept Library entirely. This makes a clean ablation infeasible: any such comparison would conflate the removal of the LLM with the removal of high-level semantic concepts, thereby blurring the effect under study. 3) Traditional symbolic regression methods (e.g., GP, SRBench) rely solely on the Symbolic Library and therefore cannot propose or reason over high-level, text-derived concepts. As a result, their outputs often lack the semantic richness and real-world relevance that LLM-enhanced models provide. This is not an apples-to-apples comparison, and such a comparison would overlook the core novelty of our approach.

In short, the LLM is not just a computational enhancement; it is essential to the inclusion of semantically meaningful, multimodal domain knowledge in the search process. We hope this clarifies the role of the LLM and why a traditional ablation would not offer a fair ablation. That said, we are open to alternative suggestions for more targeted comparisons, should the reviewer have specific ideas in mind.

Q1: Initialization in the Stage-2-only Ablation

Paper setting. For the “Individual-Level Semantic Adaptation only” ablation we seeded TextGrad with one symbolic template drawn uniformly at random. No best-of-N filtering was applied.

  • New control experiment: best / median / worst of all formulas after 30 iterations

We sampled 10 individuals from Swissmetro datasets. We selected the best, median, and worst symbolic utility formula after 30 iterations of Stage 1. We then ran TextGrad on these formulas as initial seeds.

LLMInit‑seedSwissmetro Acc. (mean ± std)Swissmetro F1 (mean ± std)
GPT‑4o‑minibest0.8611 ± 0.12730.4127 ± 0.1803
median0.8333 ± 0.28870.4074 ± 0.2313
worst0.6250 ± 0.23200.3558 ± 0.2319

Key Observation After TextGrad refinement, median-quality seeds converge to within ≈ 3 pp Acc and ≈ 0.6 pp F1 of the best seed, while even the worst seed retains ≈ 73 % of the best accuracy. Hence Stage 2’s boost comes from gradient adaptation, not from simply picking the “best of N” start.

评论

This has cleared up the majority of my concerns. Thank you for the response!

评论

We’re glad our rebuttal resolved most of your concerns. If any additional questions arise, we’re happy to clarify further. We really appreciate your thorough review and constructive feedback!

审稿意见
5

This paper proposes ATHENA (Adaptive Textual-symbolic Human-centric Reasoning), a framework for personalized decision modeling that combines symbolic utility discovery and LLM-powered semantic adaptation. The approach first uses LLM-augmented symbolic regression to derive interpretable group-level utility functions for predefined demographic groups, then refines these models at the individual level via TextGrad to capture personal preferences and constraints. Evaluations on transportation mode choice and COVID-19 vaccine uptake tasks show that ATHENA outperforms classical utility-based models, machine learning classifiers, and LLM baselines by at least 6.5% in F1 score. The method is notable for producing interpretable, human-centric models, but its reliance on manually specified demographic groupings limits its flexibility and raises questions about the robustness of these partitions across diverse domains.

优缺点分析

Strengths

  • The paper is very well written and easy to follow, with clear motivation, technical details, and experimental narrative.
  • Visualizations are excellent, effectively illustrating the two-stage ATHENA pipeline and the learned symbolic components.
  • The proposed framework combines symbolic utility modeling with LLM-driven semantic adaptation in a conceptually elegant and technically sound way.
  • Premise is strong: addressing the gap between population-level utility models and personalized decision-making is both timely and impactful.
  • Performance is very good, achieving ≥6.5% F1 score improvement over classical discrete choice models, machine learning baselines, and LLM-only approaches on real-world datasets.
  • The two-stage design (group-level symbolic utility discovery and individual-level semantic adaptation) is well-motivated and validated through ablation studies, demonstrating clear complementarity.
  • Produces interpretable, human-centric decision models with potential relevance for high-stakes applications such as public health policy and transportation planning.

Weaknesses

  • The framework relies on predefined demographic groups for the group-level symbolic utility discovery stage, which assumes these partitions are optimal and behaviorally meaningful. This is akin to providing the top-level splits in a decision tree without verifying their significance, limiting flexibility and potentially propagating bias.
    • Is the baselines also provided the same grouping during the experiments?
  • There is no mechanism to automatically discover or validate significant groupings, which raises concerns about robustness in domains with high intra-group heterogeneity.
  • While the integration of symbolic regression and LLMs is novel, the individual components (symbolic regression, TextGrad) are adaptations of existing methods rather than fundamentally new techniques.
  • The paper could benefit from more qualitative analysis and takeaways from the synthesized group-level utility formulas. Since one of the main advantages of having an interpretable model is to derive actionable insights for domain experts (e.g., public health professionals), this feels like a missed opportunity to showcase the framework’s real-world utility beyond predictive performance.
  • (Minor) The paper has not discussed scalability of the method. When TT and TT' are treated as fixed hyper-parameters, and disregarding LLM inference time, is the methodology a constant time algorithm? Either a high level discussion or an empirical evaluation (including LLM invocation time) would be appreciated.

I am happy to increase scores when the questions are well-addressed.

问题

  • Are TT and TT’ two hyper-parameters? What are the chosen hyper-parameters for the experiments? For Figure 3, is the iteration (x-axis) showing TT or TT’?

Suggestions

  • The “Stage 1” and “Stage 2” annotations in Algorithm 1 currently look like part of the code; consider changing their style (e.g., italics or actual comment syntax) so they visually appear as comments and are easier to distinguish.
  • Style-wise, the enumerate on line 133 could have shorter left padding to improve visual alignment and readability.
  • The inline citations on lines 235–238 are attached directly to the previous word; there should be one space separating the citations from the preceding text for proper formatting.
  • Include at least one qualitative example in the main text showcasing a discovered group-level symbolic utility formula, along with proper takeaways or insights. This is crucial to demonstrate the value of having an explainable model for informing domain experts (e.g., public health decision-makers).
  • Provide more context on why Cross Entropy (CE) is included as a metric for evaluating ATHENA. Given its meaning in probabilistic models, explain its relevance to your setting and why it complements metrics like F1 and AUC.

局限性

  • Computational Complexity: Textual gradient optimization is resource-intensive, limiting scalability to large populations.
  • Predefined Group Assignment: Requires a priori demographic groups, which may oversimplify heterogeneity and miss latent patterns.
  • Group Significance: Lacks a principled method to verify whether demographic partitions are behaviorally meaningful, risking weak or biased utility functions.

最终评判理由

This paper proposes ATHENA (Adaptive Textual-symbolic Human-centric Reasoning), a framework for personalized decision modeling that combines symbolic utility discovery and LLM-powered semantic adaptation. During review, I raised questions about the reliance on demographic groups and the derived symbolic representations. Through rebuttal, the authors have justified that such setting is reasonable, and provided qualitative analysis on symbolic representations. I'm happy that my concerns are resolved and that such changes are going to be made in the final version of the paper. I'm raising my rating of the paper (4 -> 5).

格式问题

No

作者回复

We thank you for your time and effort in reviewing our paper! We find your suggestions very helpful and we hereby address your questions:

W1/W2 & L2/L3 – Predefined Demographic Grouping & Robustness

To clarify, all baseline models received the same demographic features and groups. Our choice to use these groups (common demo-based group) was a deliberate decision motivated by:

  • Theoretical Grounding: It aligns with established practices in decision choice modeling researches that use demographics to explain behavioral heterogeneity [1-3].

  • Interpretability: It yields human-readable symbolic rules that translate directly into levers. E.g., for women aged 39-54 earning 50–100 k CHF, the rules reveal that shorter travel-time, premium comfort, and baggage assistance most sway their choice -- actionable insight that guides operators toward faster services, targeted upsells, and better luggage support for this segment.

  • Robustness: Automatically discovering behaviorally consistent latent clusters is a challenging task that typically requires large datasets. Given the scale of our experimental data, relying on established demographic priors provides a more robust initial partition. It mitigates the risk of overfitting and learning spurious patterns. Also, we don't want to increase complexity or introduce additional hyperparameters to make the model more difficult to use.

Crucially, our ablation study shows that removing the group-level symbolic discovery stage causes a severe performance drop. This result underscores that a well-structured, group-informed prior is not a limitation but a necessary component for the model's success, preventing the unguided adaptation from converging to poor local optima.

While we agree automated group discovery is a valuable direction for future research, our work demonstrates the marked effectiveness of this principled, two-stage approach.

[1] Tymula, Agnieszka, et al. "Adolescents’ risk-taking behavior is driven by tolerance to ambiguity." Proceedings of the National Academy of Sciences 109.42 (2012): 17135-17140.

[2] De Bruijn, Ernst-Jan, and Gerrit Antonides. "Poverty and economic decision making: a review of scarcity theory." Theory and Decision 92, no. 1 (2022): 5-37.

[3] Croson, Rachel, and Uri Gneezy. "Gender differences in preferences." Journal of Economic literature 47, no. 2 (2009): 448-474.

W3: Component novelty

The contribution should be evaluated on the end-to-end paradigm and its conceptual novelty, not on isolated components. As noted in Section 1, population-level models often overlook each person’s unique cognitive calculus -- the belief, the unique constraints and the context-specific trade-offs that drive real decisions. ATHENA captures this calculus through a novel two-stage pipeline: symbolic regression extracts compact, group-level utility laws, and TextGrad then personalizes them to each individual.

  • Symbolic regression for insight: Generates interpretable, domain-aware formulas that reveal feature relations far beyond merely black-box predictors.
  • Informed semantic optimization: Uses those formulas as data-driven priors, mitigating TextGrad’s sensitivity to initialization and yielding faster, more reliable convergence to personalized models.

W4: Qualitative Analysis of symbolic representations

We agree that deeper qualitative analysis and actionable takeaways from our group-level utility formulas would strengthen the manuscript. Below, we provide a detailed qualitative analysis of the example Swissmetro symbolic utilities. For a parallel example on the Vaccine dataset, see the “Q2: Explainability of ATHENA utilities” section in Reviewer HnWo’s rebuttal. These examples demonstrate that our group-level utilities are both interpretable and directly actionable for domain experts.

  • Representative Example — Swissmetro Dataset
ModeDiscovered symbolic utility
TrainK·(train_time + metro_time + luggage·log(age + 1) + age + is_male) + C·(first_class + income) − C·(GA_pass + headway)
CarK·(car_time + train_time + luggage·log(age + 1) + age) + C·(first_class + income) − C·(GA_pass + metro_fare + is_male)
MetroK·(metro_time + luggage + age + is_male) + C·(first_class + income) − C·(headway + GA_pass + is_male)

Feature: Between 39 and 54 years old, identify as female, and have an income between 50 and 100.

  • Key take-aways for domain experts
InsightInterpretation & policy relevance
Time dominatesLarge negative coefficients on travel-time variables show this segment is highly time-sensitive → investments that shorten door-to-door time (e.g., skip-stop service) should shift demand [1].
Comfort premiumPositive weight on (first_class + income) across all modes indicates a willingness to pay for comfort that scales with income → targeted upselling (seat reservations, quiet cars) is effective [2].
Luggage burden grows with ageThe interaction luggage·log(age + 1) reveals baggage becomes disproportionately painful for older travelers → facilities such as luggage trolleys or porter services may raise train/metro share [3].
GA pass effectOwning a GA pass (the Swiss national annual travel pass) biases travellers away from modes that still incur extra fares (e.g. Car, premium metro segments). Extending GA coverage to Swissmetro would therefore raise its relative appeal [4].

These insights translate raw coefficients into specific levers for service design and policy -- exactly the actionable value the reviewer is seeking. Taken together, these conclusions are highly valuable for practitioners and align closely with the extensive body of prior research on travel behavior and mode choice.

[1] Shires, Jeremy D., and Gerard C. De Jong. "An international meta-analysis of values of travel time savings." Evaluation and program planning 32, no. 4 (2009): 315-325.

[2] Abrantes, Pedro AL, and Mark R. Wardman. "Meta-analysis of UK values of travel time: An update." Transportation Research Part A: Policy and Practice 45, no. 1 (2011): 1-17.

[3] Chang, Yu-Chun. "Factors affecting airport access mode choice for elderly air passengers." Transportation research part E: logistics and transportation review 57 (2013): 105-112.

[4] Weis, Claude, Kay W. Axhausen, Robert Schlich, and René Zbinden. "Models of mode choice and mobility tool ownership beyond 2008 fuel prices." Transportation Research Record 2157, no. 1 (2010): 86-94.

Q1: Hyper-parameter clarification

Yes. In ATHENA both TT and TT’ are user-set hyper-parameters:

  • TT – maximum iterations of Stage 1: Group-level Symbolic Utility Discovery.
  • TT’ – maximum iterations of Stage 2: Individual-level Semantic Adaptation.

Figure 3 tracks the symbolic-discovery loop, so its x-axis is TT (30 iterations in our runs).

SymbolRole in ATHENAValue used
LLMsBackbone models (both stages)gpt-4o-mini, gemini-2.0-flash
TTIterations in Stage 1 (Fig. 3)30
TT'Iterations in Stage 25
KKCandidate utilities per Stage 1 iteration10

W5 & L1: Scalability and parameter TT and TT'

1. Theoretical analysis

With TT, TT' fixed, the algorithm is linear, not constant:

ttotal=tStage1+tStage2GKT_symbolic utility search+NT_textual gradient refinementt_{\text{total}} = t_{\text{Stage1}} + t_{\text{Stage2}} \sim \underbrace{\lvert\mathcal{G}\rvert\cdot K\cdot T}\_{\text{symbolic utility search}} + \underbrace{N\cdot T'}\_{\text{textual gradient refinement}}

Each term is multiplied by the average LLM latency τ\tau (roughly linear in token count):

O((GKT+NT)τtok)  .\mathcal{O}\bigl((\lvert\mathcal{G}\rvert K T + N T')\,\tau_{\text{tok}}\bigr)\;.

Both stages can be parallelized:

  • Stage 1: symbolic searches for each demographic group run independently.
  • Stage 2: TextGrad refinements for each individual can be batched or dispatched to separate workers.

2. Empirical evidence

Using gpt-4o-mini, we measured wall-clock time and token usage on the Swissmetro subset:

Stage 2 – Individual adaptation

TT' (iterations)Stage-2 time (s)s / itertokens / iter
148.1648.161079.6
3176.8958.961195.83
5315.6663.131249.28

Stage 1 – Group-level discovery

TT (iterations)Stage-1 time (min)tokens total
530.61215281.3
1536.69251974.1
3065.64479751.3

Here we can see runtime and tokens grow linearly with iterations: Stage 2: 48 s → 316 s (T′ 1 → 5); Stage 1: 31 min → 66 min (T 5 → 30)—so a full Swissmetro run (T = 30, T′ = 5) finishes in < 70 min and ≈ 0.5 M tokens on one worker, with near-perfect parallel speed-up.

S1\S2\S3\S4: Paper Writing

Thank you for the helpful suggestions. We have revised all relevant instances in the manuscript. The updates are summarized below.

PositionOriginalRevised
Algorithm 1, Near L198Stage 1:// Stage 1 – Group-Level Symbolic Discovery
Algorithm 1, Near L208Stage 2:// Stage 2 – Individual-Level Semantic Adaptation
Page 3, L133, L137 1. … 2. ...1. … 2. ... (reduced left padding)
Page 7, L235–238eg. "…model [98,99] and…", "...few-shot method[100,101]"..."…model [98, 99] and …" (added spacing before citations), "...few-shot method [100, 101]"...

S5: Cross Entropy metric

Cross-Entropy is important because the ATHENA models decision-making as a probabilistic choice, consistent with the Random Utility Maximization theory. A lower CE value shows the model assigns higher probabilities to the choices individuals actually make.

While F1/AUC measures classification performance, CE provides a distinct and complementary evaluation of the model's confidence calibration. With all those metrics, we can better understand the model's performance and robustness.

评论

Thank you for your detailed and thoughtful response. I am satisfied with the answers and please make sure to add these information into your final version of the paper. I will keep my positive score.

评论

Dear Reviewer F7TL,

We sincerely appreciate your detailed and thoughtful feedback. Your suggestions have significantly contributed to the clarity and depth of our work!

We are very glad to hear that our responses addressed your concerns and that you are satisfied with the current revision. Of course! We will ensure that all the discussed improvements—including hyperparameter analysis, qualitative analysis of symbolic representations, and formatting refinements—are carefully incorporated into the final version.

Thank you again for your constructive comments! Please feel free to reach out if you have any further thoughts or suggestions—we would be more than happy to engage in further discussion.

Sincerely, The Authors

最终决定

The paper tackles the problem of bridging the gap between population-level utility models and individual decision-making through a two-stage framework combining symbolic utility discovery with LLM-powered semantic adaptation (termed ATHENA).

In summary, the reviews suggest that this work hits a sweet spot of methodological innovation and practical relevance. The authors elegantly sidestep the false dichotomy between interpretable-but-rigid utility models and flexible-but-opaque neural approaches. Reviewers specifically comment on several strengths. Among others:

  • The quality of writing and presentation
  • Strong premise/motivation
  • Good performance
  • Interpretability
  • Originality of the idea

The authors also made a commendable effort in addressing reviewer concerns by adding stronger baselines and cost analysis, the latter of which shows the approach is practically viable even under budget constraints.

There were a few weaknesses noted:

  • The reliance on predefined demographic groups is both a strength and limitation. Though this is well-justified theoretically, it does constrain generalizability
  • More importantly, the evaluation is limited to two domains with relatively small datasets. The claim of broad applicability would ideally be backed by stronger empirical support across diverse contexts
  • The symbolic discovery stage, while novel in its LLM integration, essentially performs a kind of guided search over a constrained space. The innovation lies more in the orchestration than a fundamental algorithmic advance

Overall, my view is that this paper advances an important research direction with a technically sound approach and clear promise for high-stakes applications. The authors have also responded well to many of the raised concerns. I think this is a clear accept.