Testing the full three-level preference hierarchy. Apologies if I didn't parse this out in the paper, but is there any evaluation that directly attempts to represent three-level preferences in a single example? It seems that most of the evaluation is modularized (e.g. parametric vs context, or context vs instruction).
Zero-Shot Inference w/ HierPref -- it seems that you mostly report zero-shot results for your method, while providing 0/3/5-shot results for e.g. Mistral w/ Alpaca. Are the few-shot trends different for HierPref?
Section 3.1 -- Could you clarify how you extend MQuake-CF-3k? The writing in this section is somewhat hard to follow so I worry that I didn't understand fully. However, my understanding is that you took examples (w/ existing edit chains), and used GPT-3.5 to augment each example with synthetic 'context passages' that contain the knowledge which is true in both the original and the counterfactual edit plans. Is the idea that, previously, MQuake was able to test parametric vs instruction knowledge preference, whereas your goal with InstructMH-3k was to test retrieved vs instruction knowledge preference? The case study figure was helpful, but perhaps you could also provide a side-by-side of an MQuake-CF-3k and an InstructMH-3k example in the Appendix?
Side-effects of instruction tuning w/ HierPref -- are there any concerns that this instruction tuning could have an adverse affect on the parametric knowledge/factuality of your resultant tuned model? I think your Table 5 may discuss this somewhat. Actually, in Table 5 it seems that Mistral w/ HierPref actually has a lower rate of providing an incorrect parametric answer during your probing task? Do you have any thoughts on these questions?
Clarifying Tables 4+5. Am I correct that Table 4/5 are roughly as follows: You run each model on the full MRQA data, and the find the respective examples which are wrong when leveraging only parametric knowledge. Then after you filter to just this subset, you run the evaluation (table 4) to understand how well models can resolve the wrong parametric knowledge through the (correct/oracle) context knowledge?

Typos / Grammar / Presentation suggestions

L174 (phrasing) -- "...and it's more likely to..."
L455 (phrasing) -- "Then is..."
Table 4 -- would help the presentation if you could bold/highlight the top results, and also indicate if higher/lower is better for P(U_c) and P(U_i).
L436 (grammar) -- "in the gold passage setting"
General comment on tables+presentation -- It would help if you could concretely state + remind the reader which task is being evaluated for each table/dataset (e.g. "...We test Parametric vs. Context Knowledge in the multi-hop... "). It could also be helpful to have a table that summarizes all of the data (both for training and testing) and which combinations of knowledge preferences are being evaluated for each.
L314 (typo) -- 'no-trivial'
L238 (phrasing) -- '..retrieval-augmented QA data with context-supported answer conflicting with LLMs’ parametric answer (Sec. 4.3)...'
Figures 3 + 4 -- would help if you could add more informative descriptions. It was pretty hard to follow what is going on in these. Also, the figures didn't seem to be cited within the main text body. In general, the titles for the steps outlined are hard to internalize ("Modeling Preference for Context Knowledge step") -- might help if the table captions are more descriptive (e.g. "We generate examples that encourage Instruction > Context Knowledge")
Section 4.1 -- Paragraphs 1 (phrasing) and 3 (pretty dense explanation) are hard to follow (especially paragraph 3). I would try to refer to figures as much as possible and condense+format the text description to make it easier for the reader.