6.0

/10

Poster4 位审稿人

最低5最高7标准差0.7

3.8

置信度

COLM 2025

Partial Perspectives: How LLMs Handle Logically Inconsistent Knowledge in Reasoning Tasks

Zichao Li,Ines Arous,Jackie CK Cheung

OpenReview PDF

提交: 2025-03-22更新: 2025-08-26

TL;DR

We propose a new framework based on Markov logic network to evaluate LLM's reasoning over inconsistent knowledge and release accompanying datasets.

摘要

关键词

evaluation methodologiesreasoninglogical reasoningcalibration/uncertainty

评审与讨论

审稿意见

评分: 5置信度: 42025-05-10

This paper proposes an interesting evaluation of LMs reasoning over knowledge structures with graded consistency, evaluating against a known symbolic method for probabilistic logical reasoning. The topic is interesting and clearly important. The main questions that remain for me are about determining the validity of the ground truth, and the main limitation is that the work would strongly benefit from human baselines (e.g. to help support the validity of the ground truth).

接收理由

Interesting and relevant topic — reasoning over inconsistent information is clearly useful for real-world applications in many domains, and is conceptually interesting from a general AI perspective.
Decent benchmark design.
Relatively thorough experiments, e.g. a variety of models, various decoding methods.
Some useful analysis of where and why models fail.

拒绝理由

My key concern with the paper is about the ground truth it uses as a standard for evaluating models. As the authors note, defining ground truth correctly is challenging; yet immediately after noting this the authors adopt Markov logic networks as their ground truth. I would expect to see some more justification for why this is a good choice beyond simply the fact that it can integrate both logical and probabilistic constraints in the presence of inconsistency.
- For example, the fact that GOFAI considered too many constraints to work in practice is precisely one of their major challenges (cf. the frame problem: https://plato.stanford.edu/entries/frame-problem/). So maybe the fact that LMs fail to accommodate all relevant rules, and fall back on their prior knowledge, is not a symptom of their inadequacy, but is rather a necessary feature for any system that achieves sufficient generality in situations as complex as the real world.
- Likewise, humans are typically seen as the standard for reasoning, but are known to be biased towards their prior knowledge in both how they accumulate knowledge (cf. confirmation bias), and in how they reason. In some cases, LMs biases in cases where logic conflicts with prior knowledge precisely match the human biases (e.g. https://academic.oup.com/pnasnexus/article/3/7/pgae233/7712372) — thus suggesting that this is not a "failing" to meet the standards of reasoning set by humans, but rather reflective of more common constraints.
- The above points make me strongly feel that this work would benefit from a human baseline evaluated on the same tasks, to understand how humans perform by the same metrics, and whether they show the same patterns of success and failure. If human performance more closely matches the MLN, that would help to justify the choice of the MLN as ground truth. On the other hand, if humans show similar patterns of success and failure, that would motivate reconsidering the interpretation of the results — perhaps as evidence about more general patterns of reasoning under inconsistency.

给作者的问题

How would humans perform on this benchmark?
Can you justify the choice of MLNs as ground truth more clearly?

评论- Author Response

2025-06-03

We thank the reviewer for their valuable feedback. We address all concerns below.

Why using MLN instead of human answers as ground-truth

It's an interesting question! We provide in-depth elaboration below:

Our evaluation is based on that reasoning over inconsistent knowledge should follow three desiderata and principles (stated in L115-121):
- maximal internal consistency within our knowledge of the world
- integrating uncertainty into the reasoning process (as opposed to binary true-false beliefs)
- prioritizing newly acquired knowledge as it becomes available

MLN combining probability and logic satisfies these three principles. Although there are other methods to represent probabilistic logic, MLN is the most widely used one in ML and AI, and it has been previously validated as effective frameworks across numerous practical downstream tasks (L110-L111).

The objective of our work is not to assess whether LLMs mimic human reasoning. As rightly highlighted, human reasoning is subject to inherent biases leading to flawed decision making. Instead, our evaluation specifically targets the ability of LLMs to resolve inconsistencies according to above principles. Moreover, the task is notably challenging even for humans, as it demands substantial numerical computation and systematic evaluation of alternative solutions. Precisely because of this complexity, a LLM that can perform well would be valuable for supporting real-world decision making. That said, we agree it would be very insightful to explore how humans navigate these logical inconsistencies compared to LLMs. Such an investigation would have a different goal from our study—investigating the capabilities and limitations of human cognition.
Notably, the failures of following MLN are not due to undeclared or unclear constraints similar to issues like the frame problem, since we explicitly include all relevant rules and facts in with the instruction following the MLN.
In the General response, we further elaborate on the reason for not including human baseline for benchmark. We sincerely invite you to review it.

2025-06-04

Thanks for the response.

Fom my perspective, the fact that humans would find appropriate reasoning in this setting challenging is all the more reason that it would be an important comparison condition. The primary goal of such assessment would not be to assess human reasoning over inconsistent knowledge (although that might be interesting), but rather to provide another comparison point in the space of solutions to the problem.

As another motivation, there are many other implementations (besides the specific case of MLNs considered here) that could satisfy the three high-level desiderata provided. Indeed, presumably even if humans find the full problem challenging, they would adhere these desiderata in simpler situations to some degree. Thus, it would be useful to understand, as a function of problem difficulty, the match between language models, humans, and more normative models such as MLNs.

Thus, I don't find that this response changes my opinion of the paper. I still find the results somewhat interesting, but I think more points of comparison would make the paper stronger, and would make me more likely to advocate for acceptance.

评论- Author Response

2025-06-05

Thank you for your feedback. If possible, we would like to respectfully reiterate two points for your kind consideration.

We appreciate your agreement that humans would indeed find this task challenging. This inherent complexity is precisely why it is difficult to validate whether LLMs strictly adhere to the three principles by comparing them with human answers.
Although there are other implementations of these principles, MLN is the most widely used one in ML and AI and its practical utility is efficiently proved (L110-L111). It's also used in biology [1], genetics [2], and IoT [3] as a grounded framework for inferring results based on uncertain facts. That means, well-performed LLMs would have high utility in real-world applications.

We fully agree that comparing LLM reasoning to human cognition in scenarios involving inconsistent knowledge is an interesting research topic, but it falls outside the intended scope of our current work.

[1] Sakhanenko, Nikita A., and David J. Galas. "Probabilistic logic methods and some applications to biology and medicine." Journal of Computational Biology 19.3 (2012): 316-336.

[2] Sakhanenko, Nikita A., and David J. Galas. "Markov logic networks in the analysis of genetic data." Journal of Computational Biology 17.11 (2010): 1491-1508.

[3] Ala, Ali, et al. "Improving smart deals system to secure human-centric consumer applications: Internet of things and Markov logic network approaches." Electronic Commerce Research 24.2 (2024): 771-797.

2025-06-06

If I may respectfully reiterate my point, I disagree that the comparison falls outside the scope of your current work. If the goal is to understand "How LLMs Handle Logically Inconsistent Knowledge in Reasoning Tasks" it is very reasonable to request comparing them to other systems as a baseline, to understand how their performance relates to these other systems.

I am also not convinced that the fact that MLNs have been used in some application domains clearly demonstrates that LMs that performed well at these tasks would "would have high utility in real-world applications." Presumably, where MLNs solve the problem then the application domains could just use MLNs; where MLNs don't solve the problems, then LLMs approximating MLNs better would likely not solve the issue. If the idea is that LMs might solve the problem in a more general/flexible/easy to apply way, that is all the more reason for considering other baselines besides MLNs.

I would like the authors to note that I have given their paper a marginal score, not a strong reject, because I do see some value in the work as it stands. However, I maintain that having more baseline comparisons (including humans) would be required for me to advocate for accepting the paper.

审稿意见

评分: 7置信度: 42025-05-11

This paper proposes a framework to evaluate LLMs’ reasoning over inconsistent logical knowledge. It introduces a new dataset, iKnow, based on Markov logic networks (MLN). The dataset consists of verbalized first-order logical rules with assigned weights and known facts as the knowledge base. The evaluation includes QA and knowledge ratification. Experiments show that LLMs perform poorly at reasoning over inconsistent knowledge, primarily because they are unable to incorporate all relevant rules. The overall experiments and results are comprehensive, but more analysis and a human study need to be done.

接收理由

Reasoning under uncertainty is important. The proposed MLN framework provides a useful tool, and the introduced dataset, iKnow, could be valuable for evaluating LLMs’ reasoning over inconsistent knowledge.
The experiments are comprehensive, covering a wide range of models. Further analyses are generally thorough, and both quantitative and qualitative analyses are included.
The paper is well written and easy to follow.

拒绝理由

Do the authors have any intuition why different rule weighting schemas barely change the LLM’s performance (Figure 4)? I find it quite strange that the polarized one does not perform better. The authors attribute this to LLMs struggling to integrate knowledge uncertainty (L301–303), but in this case there is almost no uncertainty, as one rule clearly dominates the others. This, together with Finding 3, makes me feel like LLMs might not even understand what “weight” means. Are there any further investigations into whether the LLM really understands the probabilistic rules in the first place?
One confounding factor is that LLMs might simply rely on their parametric knowledge to perform the reasoning, as shown in Finding 4. A simple fix could be to use synthetic named entities instead of real names, which should be straightforward, since the facts are instantiated from templates. Have the authors considered this option? If so, what do the results look like? If not, what’s the reason?
I found it hard to understand the difficulty of the task and interpret some of the results. While the polarized schema seems intuitively easier, the random and equal schemas could be even harder for humans. It would be helpful to include a (even small-scale) human evaluation to better understand human performance and what we should expect from the models.

给作者的问题

Equation 4: the union seems to suggest adding F’ to the KB. Is this the actual implementation? Are the existing facts removed?
Figure 6, the left and right panels use different legends; it would be clearer to make them consistent.
Figure 5, the title should clarify the difference between the left and right plots.
Figure 7 should be Table.

评论- Author Response

2025-06-03

We thank the reviewer for their valuable feedback. We address their concerns and questions below.

Response to reasons to reject

Why weighting schema has little effect on LLM performance

We agree this finding appears counterintuitive. Initially, we anticipated that LLMs would strategically prioritize key rules and omit irrelevant ones. However, we observed that the models tend to enumerate all available facts, rules, and associated weights explicitly within their reasoning chains, subsequently selecting those they deem relevant. Crucially, this selection is influenced by factors unrelated to rule weight, such as LLMs' parametric knowledge.
To further address your concern, we add an additional experiment confirming that LLMs do indeed recognize higher weights as indicators of higher priority. Please refer to our general response for detailed results.

Experiment with synthetic entities

We agree that adding an experiment with synthetic entities could be interesting. In fact, we did attempt such a setup, but it introduced ambiguity issues, particularly because LLMs sometimes struggle to distinguish entity types purely based on lexical forms of the fictional entities. For instance, they often couldn't consistently distinguish whether a made-up place like "Solkara" represented a city or a country. Accurate type recognition is crucial for correctly applying logical rules, such as "[city] is capital of [country] → [city] is located in [country]," when deriving implications like "Solkara is located in xx."

Human baseline as benchmark

It is an interesting suggestion. In our general response, we thoroughly explain our rationale for not including a human baseline. We sincerely invite you to review it.

Response to questions

Equation 4: $F$ denotes existing facts, all of which remain unchanged. The model can only modify the new facts $F'$ into $\hat{F}'$ , aiming to maximize internal consistency across the entire knowledge base, including rules, $F$ , and $\hat{F}'$ .

Figure 5&6&7: Thank you for your careful reading! We'll change them in the revised version.

2025-06-06

Thank you to the authors for the response! I'm not convinced by the arguments for "why not evaluate human performance" in the general response. The fact that this task is difficult for humans is not a reason why human performance is irrelevant. In fact, understanding human behavior could help us know what we should expect from the models. Also, the claim that prior work does not evaluate human performance is also not a good reason to consider it not important. That said, I think this paper still provides some insights into how LLMs handle inconsistent knowledge. I will maintain my score based on the overall assessment.

审稿意见

评分: 6置信度: 42025-05-13

This paper proposes a framework for evaluating large language models (LLMs) on reasoning under uncertainty in the presence of inconsistent knowledge bases. Uncertainty is modeled using weighted logical rules in the form of Markov logic networks (MLNs). Two tasks are introduced, question answering and knowledge rectification, to assess a model’s ability to answer questions by integrating inconsistent knowledge or rectify their acquired knowledge to improve consistency. A dataset with MLN-formatted knowledge bases is curated to implement these tasks. Experiments demonstrate that LLMs fall short in uncertainty-aware reasoning over inconsistent logical knowledge.

接收理由

This paper investigates an interesting and important research question: whether LLMs can effectively handle uncertainty in their reasoning process to maximize knowledge consistency. While the finding that LLMs often fail to resolve inconsistencies is not surprising, it offers insights into future research in this direction.
The new dataset, iKnow, which includes knowledge bases composed of factual statements and logical rules expressed in natural language, can be a useful resource for further research.
The use of MLNs to examine LLMs’ reasoning with logically inconsistent knowledge is a novel, and the approach is well-grounded, drawing from established principles.

拒绝理由

The tasks primarily focus on logical reasoning. It would be valuable to explore whether the findings generalize to other forms of reasoning, such as commonsense or mathematical reasoning.
The current tasks are relatively simple, involving only first-order logic rules. Investigating more complex scenarios with higher-order logic could further test LLMs’ reasoning capabilities.
It would be interesting to evaluate whether more advanced prompting strategies, such as Tree of Thoughts (ToT), can outperform CoT in enhancing LLMs’ logical reasoning performance.
Figures are blurry. Particularly, figures 5 and 6 are too small to read.

评论- Author Response

2025-06-03

We thank the reviewer for their valuable feedback. We address their concerns below.

Generalized to other reasoning tasks

As stated in our limitations section (L543-551), we do not make strong general claims beyond the specific tasks and LLMs investigated in this study. Nevertheless, it should be noted that our tasks inherently involve components, such as mathematical reasoning guided by MLNs, that are applicable in other tasks and settings.

Only consider first-order logical rules

We agree and acknowledge this limitation explicitly in L534–536. Nonetheless, our evaluation framework is designed to be flexible, allowing future extensions to incorporate higher-order logical rules through dataset adaptations.

Advanced prompting methods

We adopt ICL and COT since they are the most widely used prompting methods in practice. As the main focus of our work is on studying how LLMs reason over inconsistent knowledge instead of enhancing this capability, we leave investigating more advanced prompting methods and other techniques for future work.

Figure 5&6 too small

We will increase image size given more space in the revised version. It should be a very easy fix. Thank you for pointing this out!

2025-06-10

Thank you for your response. I believe my rating accurately reflects my evaluation, and I would prefer to maintain it as is.

审稿意见

评分: 6置信度: 32025-05-25

Standard reasoning tasks assume that the input data is consistent and reliable. This paper proposes a framework for evaluating reasoning in LLMs with in-consistent input. They curate a new dataset iKnow consisting of ~3,000 KBs, each of which contains factual statements and weighted logical rules expressed in natural language. They propose two tasks, consistency-aware question-answering and knowledge rectification based on these KBs. Ground truth answers were computed by a Markov logic network. They show that LLMs often fail to comprehensively resolve knowledge inconsistencies but instead draw conclusions based solely on a subset of the KB.

接收理由

The paper presents an valuable benchmark task is relevant to more real world setting in which LLMs have to handle contradictory knowledge.
The presented benchmark seems carefully constructed and is human verified.
The paper is well written and easy to follow.

拒绝理由

More work should have gone into incorporating the weights well. Weights are presented with only one shot and without prompt-tuning a natural language instructions or fine-tuning. It is not surprising to me that this doesn't work well. The way it is done now, I wonder if it would have been better to only use rules and facts and drop the weights entirely as weights are unrealistic anyways.
I would have liked to see a human baseline on this dataset.
I would have liked to have the dataset attached with this submission.
Authors don't cite fact checking work where their problem statement naturally occurs.

给作者的问题

You state that LLMs were able to follow the output format consistently hence you applied exact match metrics. How did you conclude this. Did you manually verify a set of mismatched answers? Could you report stats?
Line 337: How do you "determine whether all relevant rules were considered in the derivation"?
Line 307: You mention that "As we will show in more in-depth analysis later this is mainly because LLMs struggle to integrate knowledge uncertainty into the reasoning process.". I see the in-depth analysis later but can't find anything specifically about uncertainty. Where should I look?
Please add a section placing your work into the context of related work on fact checking specifically benchmarks existing in that context.
Here is some more related work on knowledge consistency (please have a look into their related work sections as well and incorporate into your paper what you see fit):

Mitchell et al.: Enhancing self-consistency and performance of pre-trained language models through natural language inference.
Jung et al.: Maieutic prompting: Logically consistent reasoning with recursive explanations.
Kassner et al.: Language Models with Rationality

评论- Author Response

2025-06-03

We thank the reviewer for their valuable feedback. We address all concerns and questions below.

Responses to reasons to reject

Whether LLMs really understand the weights & should we drop them

While the weights are synthetic, they hold practical value, allowing users to guide reasoning explicitly by assigning priorities. Hence, evaluating whether LLMs can incorporate these weights effectively is meaningful.
We acknowledge that it may seem intuitive in retrospect that LLMs find this challenging. However, our findings complement recent work showing unexpected LLM proficiency on precise computational tasks, such as regression [1] and arithmetic reasoning [2].
Our evaluation already includes models specifically tuned on mathematical datasets (e.g., Phi-4, the LLaMA family) and models known to excel on math-related benchmarks (e.g., GPT models).
To further address your concern, we add an additional experiment to explicitly test whether LLMs recognize weight-priority correspondence. Please see further details and outcomes in our general response.

Human baseline on the dataset

It is an interesting suggestion. In our general response, we thoroughly explain our rationale for not including a human baseline. We sincerely invite you to review it.

Release of dataset

We reaffirm that the dataset will be publicly released upon publication.

Citation of fact checking work

Due to the space limit, we primarily cite the work on knowledge conflicts and incorporating probability and logic. But we will definitely cite them in the revised version given more space.

Responses to questions

Whether LLMs adhere specified output format: In preliminary checks, we manually examined their outputs and found that, e.g. Llama3 strictly adhered to the format in 47 out of 50 instances. Therefore, we simply adopt the exact match score.
How to determine whether all relevant rules were considered in the derivation: We inspected the generated reasoning chains and annotated each instance according to whether all applicable rules had been explicitly considered.
Findings about uncertainty: By "integrate knowledge uncertainty", we mean the reasoning process does not correctly take the weights (stated in L153) into account. The findings in the subsequent subsection are all about failures in incorporating these weights.
Add related work on fact checking Thank you for your suggestion! We will add a new subsection on fact-checking in the related work section.
Other related work on improve knowledge consistency in LLM: Thank you for your suggestion! We will cite them in the revised version given more space.

Reference

[1] Verbalized Machine Learning: Revisiting Machine Learning with Language Models

[2] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

2025-06-10

Dear Reviewer rzYg,

The discussion period ends tomorrow. If you have any thoughts regarding the authors' response, feel free to share them, especially if it changes your initial opinion.

Thanks, Your AC

2025-06-10

Thank for addressing my questions.

The weight experiment the authors provided is useful but I would still have liked to see results working without weights or training with weights.|
I acknowledge that you don't think a human baseline is necessary but without that nor comparison to related work in the context of fact checking this work is not comparable with any baseline (Also pointed out by reviewer sz1U)

I maintain my score.

评论- General response

2025-06-03

We thank the reviewers for their recognition of various aspects of our work, including our interesting and important research question (R2, R3, R4), novel and useful evaluation methodology (R2, R3), well-curated and valuable benchmark and dataset (R1, R2, R3, R4), thorough and comprehensive experiments (R2, R3, R4) with useful analysis (R3, R4) that offer insights for future research (R2), and well-written presentation (R1, R3).

Below, we address two main concerns below, namely the lack of human performance as a point of comparison and whether LLMs can "understand" the "weights" of rules.

1. Why not evaluate human performance as a point of reference

Most conventional NLP tasks assume that human annotations serve as a reliable gold standard, based on the reasonable assumption that NLP systems should aim to replicate human understanding and use of language. However, our task setting goes beyond this paradigm in that it involves a type of reasoning that people would find difficult. Specifically, our scenario features inherently inconsistent input knowledge, where humans are subject to cognitive biases, such as confirmation bias[1] and belief perseverance[2], that result in flawed judgments. Also, the task needs precise computation of probabilities of competing hypotheses, which would be challenging for standard crowdsourcing workers. However, these issues which cause difficulties for human annotations are also precisely the reasons why our task is interesting and relevant, as LLMs that perform well on this task would have high practical utility in augmenting what people could do.
For these reasons, we do not evaluate how LLMs mimic human reasoning over inconsistent knowledge. Instead we test if LLMs can reason following the Markov logic network, based on the three principles elaborated in the paper (L115-121): maximal internal knowledge consistency, integrating uncertainty into reasoning, and prioritizing newly acquired knowledge.
To the best of our knowledge, prior work evaluating how LLMs handle knowledge conflicts [3,4,5] similarly omit direct human comparisons as standard practice.
That said, we agree it would be very interesting to explore how human reasoning handles these logical inconsistencies compared to LLMs. Such an investigation would give us insight into human cognitive strengths and biases, and would be an open research direction, requiring substantial additional effort and resources. Consequently, we leave this investigation to future work.

2. Whether LLMs really "understand" the "weights" of rules

Another concern is whether LLMs genuinely comprehend the meaning of rule "weights." To address this, we conducted a targeted experiment testing if LLMs recognize that higher weights correspond to higher priorities Specifically, we provided the same lists of rules and their corresponding in the same format, then asked LLMs to select the most important rules. Specifically, we provided identical rule lists with their weights presented in the same format and asked the models to select the most important rules. We evaluated the three lowest-performing LLMs from our main experiments, namely Llama3, Qwen2.5, and Phi3, using two weighting schemes (polarized and random) that contain various weights across different rules. Their F1 scores are shown below:

LLMs	Polarized	Random
Llama3	98.27	79.17
Qwen2.5	100.00	83.33
Phi3	97.90	61.00

As shown, LLMs can retrieve the most important rules according to their weights with high accuracy. But they fail to incorporate these priorities into reasoning, leading to poor performance on iKnow. We will include these findings in the revised version.

Reference

[1] Nickerson, Raymond S. "Confirmation bias: A ubiquitous phenomenon in many guises." Review of general psychology 2.2 (1998): 175-220.

[2] Anderson, Craig A., Mark R. Lepper, and Lee Ross. "Perseverance of social theories: The role of explanation in the persistence of discredited information." Journal of personality and social psychology 39.6 (1980): 1037.

[3] Longpre, Shayne, et al. "Entity-Based Knowledge Conflicts in Question Answering." EMNLP 2021.

[4] Xie, Jian, et al. "Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts." ICLR 2023.

[5] Jin, Zhuoran, et al. "Tug-of-War between Knowledge: Exploring and Resolving Knowledge Conflicts in Retrieval-Augmented Language Models." LREC-COLING 2024.

最终决定Accept

2025-07-08

Thanks all for your engagement. Reviewers found the experiments reasonable and analyses interesting. Key discussion points were: (1) whether models truly incorporated the rule weights provided, and (2) the relevance of a human baseline.

Regarding (1), the authors' follow-up showing that the models are able to identify the most highly weighted rules suggests that the weight information is at least "understood". Still, I think the Q from Reviewer rzYg on whether presenting the rules with explicit numeric weights is the most sensible setup still stands, although this isn't a reason to reject. What would be nice to discuss is comparison with a purely rank-based approach (e.g., Kazemi et al. (2023) adopting explicit rule preferences), which in terms of inference needed aligns with the "polarized" setting. Whether there are applications that'd clearly benefit from having the weight-based representation scheme seems like an empirical argument, which may also tie in with the need to understand the reasoning problem/domain better. Which brings us to the next point...

Regarding (2): I saw the requests for a human baseline as a need to understand the problem we're actually dealing with. There are objectives being maximized, but whether they are useful ones to be maximizing for the tasks discussed is insufficient (3.2 & 3.3 didn't seem convincing enough for the reviewers), which leads to the dissatisfaction RE: "ground truth" derived from the MNL engine. In this regard, it'd have been useful to see a small-scale human analysis (could be by the authors) of the ground truth, especially for "hard" cases (e.g., conflicting rules with equal weights). This could be observational: e.g., When people look at the problem/ground truth, are they convinced that the answer is right? If not, what does that even mean? Note that this isn't arguing that people should 100% agree with the ground truth for the answers to be considered gold; like the authors argued, it is a hard problem. Rather, it'd be an effort to understand what is being treated as gold and to gain confidence that we're optimizing for something reasonable.

Minor, but prompting an LLM to simulate an MLN engine (Table 4) seems a bit artificial and wonder if there is a more naturalistic formulation.

Overall, good empirical work with topical fit to COLM, but could benefit from more thoughts on incoporating the inference problem into an LLM setting and deeper thoughts about the ground truth of the tasks discussed.