Unnatural Languages Are Not Bugs but Features for LLMs
摘要
评审与讨论
This paper studies unnatural prompts, strings that seem unintelligible to humans yet able to make Language Models produce a specific target output. The paper claims that unnatural prompts contain latent features that LMs respond to. Using a gradient-based method, the authors find the unnatural versions of examples from multiple datasets. The LMs still perform well when using the unnatural prompts as contexts. These unnatural prompts transfer across models. Moreover, LMs trained on unnatural instructions obtain a similar performance to those finetuned on natural prompts. The experiments are performed on a wide range of models and datasets.
给作者的问题
1- What are the baseline scores (random and empty) for Table 2?
2- Only the context is unnatural in the Section 3 experiments. The questions are still natural. What would happen if the questions were also unnatural?
3- What is the relationship between token overlap with original prompts and performance? What is the overlap of numbers for SimGSM8K?
4- What is the performance of a baseline that would only drops some tokens (selected for example using some feature attribution method) from the natural prompts?
5- Does memorization play a role in the processing of unnatural prompts? The models seem to easily recognize and even translate back the unnatural prompts. Is it possible that some of these prompts were just memorized by the models.
论据与证据
The main claim of the paper is not properly supported by the evidence. The paper claims that the proposed unnatural prompts contain latent features that multiple LMs respond to. However, the relevant information contained in those prompts are not hidden features but keyword tokens inherited from the natural prompts. Table 1 shows examples where some key tokens from the original prompt were kept in the unnatural prompts (numbers and some other tokens like "stock", "price" or "Carly", "arms"). The remaining tokens are fillers that the models just ignore. Indeed, the search is initialized with the natural prompts and the algorithm only learns to keep the relevant original tokens and replace the other tokens with junk tokens. The unnatural prompts of the paper are more like noisy versions of the natural prompts. As stated in the limitations section, when the search is initialized randomly and the original tokens filtered, the resulting unnatural prompts are no longer meaningful to the models. If the search algorithm mostly only identifies the key tokens to be kept from the natural prompts, the work is more about feature attribution of natural prompts rather than latent features of unnatural prompts (see Question related to this). There is no analysis characterizing the latent features that are proper to the unnatural prompts (meaning not keywords from the natural prompts).
方法与评估标准
A wide range of LMs are used in the experiments. The datasets used to evaluate the models are diverse enough to cover the capabilities of LMs. The exact hyperparameters used for the prompt search, such as number of iterations or number of candidate, are not provided in the paper making the work more difficult to reproduce.
理论论述
None
实验设计与分析
The paper lacks a detailed analysis of the unnatural prompts and the alleged features that they contain.
补充材料
There is no supplementary material hindering the proper evaluation and reproducibility of the work. There are not a lot of examples in the main paper or even in the appendix. It would help to have a look at the pairs of natural and unnatural prompts, the code and the exact prompts used to query the LMs.
与现有文献的关系
The work is related to the interpretability of LMs. It investigates how LMs process unnatural prompts. It is also related to safety as the unnatural prompts can be used to jailbreak models, and understanding them could help design better defenses.
遗漏的重要参考文献
The papers "Prompts have evil twins" (EMNLP 2024 https://aclanthology.org/2024.emnlp-main.4) and "Unnatural language processing: How do language models handle machine-generated prompts?" (EMNLP 2023) present a more extensive analysis of unnatural prompts compared to their natural counterparts.
其他优缺点
None
其他意见或建议
It would help to provide more examples of unnatural prompts in the appendix.
Minor comments: The term "semantic meaning" is a tautology.
Thank you for your insightful reviews and comments. We appreciate the time and effort you have put into providing valuable feedback. We would like to address your concerns as follows:
Concern #1 Unnatural language contains keywords
We acknowledge that the unnatural language contains keywords relevant to the original natural version. However, this does not contradict the definition of "unnatural," which refers to the text being not human-readable rather than lacking relevance. Furthermore, for the latent feature, we mean
Concern #2 Experiment details
- Code and dataset
We add an anonymous link (code url) containing our code to help reproduce. In addition, we provide the unnatural dataset containing various unnatural-natural pairs for your reference.
- Baseline scores for Table 2
Random and Empty are significantly weaker baselines compared to Shuffle and Inject (Shuf-Inj), and thus their results are omitted from the table. Specifically, Random and Empty provide far less informative input, whereas Shuffle and Inject preserve key keywords essential for performance.
- Correlation of token overlap and performance
| Correct Overlap | Incorrect Overlap | Coef | P value | |
|---|---|---|---|---|
| Mistral-7B-Instruct-v0.1 | 0.1768 | 0.2441 | -11.5579 | 0.001 |
| Meta-Llama-3-8B-Instruct | 0.1656 | 0.1853 | -4.1353 | 0.188 |
| Meta-Llama-3-70B-Instruct | 0.1669 | 0.2079 | -10.8013 | 0.008 |
| Gemma-2-9b-it | 0.1807 | 0.2124 | -5.7266 | 0.043 |
From the table, we observe that overlap and performance exhibit no positive correlation. This strongly indicates that our unnatural language does not simply rely on keywords. Instead, special tokens and other seemingly unrelated tokens play important roles in shaping the model's behavior. Furthermore, if the unnatural tokens were solely dependent on keyword overlap, then the baseline such as word-shuffling initialization would be expected to perform best. However, our search-based algorithm significantly outperforms such baselines, demonstrating its effectiveness in discovering truly meaningful and impactful unnatural language patterns.
Concern #3 Essential reference
Thank you for pointing this out. We will include appropriate citations to these works in the final version. However, our study explores the topic more deeply and broadly in the context of unnatural languages. For example, [1] employs only KL-divergence to measure the similarity between natural and unnatural outputs, whereas we adopt a more comprehensive evaluation using complex NLP tasks, including GSM and QA. Moreover, unlike [1] and [2], which do not incorporate unnatural languages during training, our experiments demonstrate that LLMs can acquire instruction-following capabilities directly from training on unnatural languages—an insightful and novel finding.
Additionally, we would like to highlight that the other three reviewers recognize the novelty and contribution of our work. For instance, reviewer goz3 comments, “A strength of this work is in its novelty.” Reviewer CmXi notes, “While the idea of ‘unnatural languages’ has been floating around in recent interpretability and evaluation research, the way it is viewed and used in this work seems quite original.” Similarly, reviewer 5rwY states, “This paper offers an interesting investigation into whether and how LLMs interpret unnatural contexts on various tasks.”
Concern #4 Other missing baselines
- Both the context and questions are unnatural
If both the context and the questions are unnatural, evaluating the answer becomes nearly impossible. Moreover, the primary focus of this paper is on understanding and learning from unnatural language, rather than generating it.
- Drop tokens
Thank you for your suggestion. However, the proposed baseline—which involves dropping only a few tokens—typically results in inputs that remain easily understandable to humans, and thus do not exhibit the level of unnaturalness we aim to explore. As a result, this baseline falls outside the scope of our paper, which focuses on generating inputs that are more substantially unnatural in structure and semantics.
Concern #5 Whether prompts are just memorized by the models
Yes, we completely agree with your point. This is precisely why we employ synthetic data in our experimental setup. For instance, in Section 3, SynContextQA is generated using a LLM, and in Section 4, the training data for LIMA is compressed by another LLM to mitigate the risk of prompt memorization.
[1] Prompts have evil twins, EMNLP 2024
[2] Unnatural language processing: How do language models handle machine-generated prompts? EMNLP 2023
Thank for your answers.
The baseline that only drops some tokens according to some feature attribution method or just in a greedy manner can produce unnatural prompts. "Level of unnaturalness" here is not clearly defined as no proxy metric or human study is proposed in the paper. Unnaturalness is defined in the answer to concern 1 as "text being not human-readable". Dropping unimportant tokens in a sentence can make it non human-readable. This baseline could even be further augmented by adding random tokens (as is done by the proposed approach) to increase the "level of unnaturalness". Therefore, this baseline does not appear to be outside the scope of the paper. This simple rule-based baseline (possibly combined with random token insertion) could perform on par with the proposed gradient-based method, producing prompts that are just as unnatural, while being more efficient and interpretable.
Thank you for your follow-up question. In response to your suggestion, we conduct additional experiments comparing the dropping-token baseline with our method. Specifically, we employed saliency techniques [1][2][3] to retain the top percentage of the most influential tokens while discarding the rest—a common approach in Explainable AI for identifying which parts of the input most strongly impact a model’s prediction.
The results are presented below. In the table:
- Pure refers to retaining only the most salient tokens.
- Random Injection (RI) denotes the addition of randomly selected tokens to increase the level of unnaturalness.
We first compare the unnatural language outputs generated by the baseline and our approach. The baseline tends to preserve the word order and keywords of the original input, making it comparatively more human-readable. In contrast, our method generates outputs that are less comprehensible to humans while maintaining critical latent features important for LLMs.
Moreover, on the SimGSM8K dataset, our unnatural examples consistently lead to significantly higher performance across multiple models compared to the baseline. This demonstrates that our method results in examples that are more unnatural from a human perspective while still preserving the essential latent structure that LLMs rely on for reasoning.
| Examples | |
|---|---|
| Pure (top0.3) | Carly arms seastar |
| Pure (top0.5) | Carlyfish5 arms each one seastar. |
| Pure (top0.7) | Carly collected7 starfish5 arms each and one seastar arms. |
| RI (top0.3) | Carly conclude Grudsignature)' nordMBERazed arms Python GorNEXT seastar anime workshop Felixlights gardenearing |
| RI (top0.5) | Carly Yale embedding prospects Controlfish practicallyunsigned5 arms each supp one seastarquote personnelscore AuthorsVal. |
| RI (top0.7) | Carly collected tra7 starfishNext bet5 arms each and one seastar amplWM substantial hacer arms. |
| Unnatural (Ours) | |Each and : algebra dinner! absolutely 7 do): shortly . seastar collectedthe \`' kW)\$, one ! 5 ! 14\` starfish with sic}}\_{\label Carly} arms. Onehorailey constructed WriteStatus(\$ \$\Toggle Zwezeichnung OK |
| Original | Carly collected 7 starfish with 5 arms each and one seastar with 14 arms. |
| SimGSM8K | Pure_0.3 | Pure_0.5 | Pure_0.7 | RI_0.3 | RI_0.5 | RI_0.7 | Unnatural (Ours) |
|---|---|---|---|---|---|---|---|
| Mistral-7B-Instruct-v0.1 | 0.07 | 0.10 | 0.18 | 0.08 | 0.11 | 0.17 | 0.42 |
| Meta-Llama-3-8B-Instruct | 0.07 | 0.06 | 0.12 | 0.08 | 0.11 | 0.16 | 0.50 |
| Gemma-2-9b-it | 0.07 | 0.12 | 0.25 | 0.07 | 0.15 | 0.23 | 0.41 |
| Meta-Llama-3-70B-Instruct | 0.07 | 0.14 | 0.24 | 0.10 | 0.16 | 0.28 | 0.75 |
[1] Efficient saliency maps for explainable AI, Arxiv 2019
[2] Grad-cam. ICCV 2017.
[3] Saliency Mapping. Wikipedia.
Thank you once again for taking the time to review our paper and for providing thoughtful and constructive feedback. If there are any remaining questions or points requiring clarification, we would be more than happy to provide further information. If you feel that your concerns have been resolved, we would be truly grateful if you would consider upgrading the score.
This paper offers an interesting investigation into whether and how LLMs interpret unnatural contexts on various tasks. It proposes a heuristic optimization algorithm to search for the optimal unnatural tokens based on the log probabilities. Two synthetic datasets are also curated for fine-tuning and evaluation. The authors also conduct comprehensive experiments and analyses.
给作者的问题
Please see the comments above.
论据与证据
-
The paper provides reasonable and concise evidence for its claims. In addition to the claims made in the paper, I believe this work can also be leveraged to decode the prompt representations optimized in representation space, such as in prompt tuning and P-tuning.
-
Besides, I have one minor question: on page 4, line 212, right column, the authors mentioned the searching algorithm could be further improved in this reasoning task. Could the authors elaborate more on the possible improvements?
方法与评估标准
- The top-k token selection is the core of algorithm 1. However, its presentation seems undetailed. Could the authors elaborate more on the following points?
- In equation 3, it is unclear what the outcome should be. Is it a tensor of shape n by k?
- For each , is the corresponding top-k set selected from the entire vocabulary? If so, this means the gradient needs to be evaluated for times for each sequence. Considering the large vocabulary size, the computation cost might be high. Following the last question, could the author provide a high-level analysis of this algorithm's complexity?
- In the actual implementation, how is the gradient of log-P calculated? Is it calculated through
torch.Tensor.backward? - In Algorithm 1, line 12, the new token is uniformly sampled from the top-k candidates. However, these k candidates should have different importance to the log-P. Therefore, sampling with the importance of their gradient of log-P might be more efficient. Did the authors try this sampling strategy?
- On page 3, line 116, right column, should the notation be ?
理论论述
- The theoretical claims are justified reasonably.
实验设计与分析
-
In Table 2, the transfer results show improvement compared to the shuffle setting, though the inputs are not directly optimized on these backbones. Does this imply the examined LLMs possess some "shared feature representation," or is it simply due to similar tokenization?
-
In Table 6, the first example, it seems the key information "86" does not appear in the denoised unnatural version. Why does the decoded internal embedding contain this number?
补充材料
Yes. I've checked Algo 1 and left my comments/questions above.
与现有文献的关系
This paper offers an interesting investigation into whether and how LLMs interpret unnatural contexts on various tasks. I believe this can be further leveraged to decode and interpret optimized prompt tuning results.
遗漏的重要参考文献
The related works sufficiently cover the scope of this work.
其他优缺点
Please see the comments above.
其他意见或建议
- In Figure 4, could the authors highlight the words on the token axis corresponding to the most significant inverse similarity drop? This would help improve visibility.
- Also, could the authors explicitly define the inverse similarity?
- On page A13, last line, there is a typo with the superscript.
Thank you for your insightful reviews and comments. We appreciate the time and effort you have put into providing valuable feedback. We would like to address your concerns as follows:
Concern #1 Improved searching algorithm
In the current implementation, we use the GCG algorithm [1], which restricts the search to a fixed length and requires extensive training time. Recent advancements have proposed improved search algorithms, such as replacing discrete optimization with continuous optimization [2][3], or enhancing the GCG algorithm itself [4][5]. We plan to further explore this direction in our future work.
Concern #2 Details of algorithm 1
- Code
The details of the algorithm we used is provided as follows. For reference, we also provide an anonymous link of our implementation code: code url
- In equation 3, what the outcome should be?
The output is a tensor in size , where each element represents a new alternative token.
- How to sample candidates based on top-k tokens on each position?
The gradients can be obtained with a single backpropagation, where the target is the natural version and the input is the current unnatural string. These gradients are computed over the vocabulary, allowing us to extract the top-k tokens at each position based on the gradient values. Belows we provide a more detailed description about how to calculate gradients.
One-hot embedding Conversion:
Given , we first replace them from discrete variable (torch.int64 in shape ) to a one-hot embedding (torch.int64 in shape ) . By doing so, we could backward w.r.t. loss log-p once and get all gradients of all tokens across all positions.
Then, to sample candidates, we first uniformly sample one position from the n token positions—this is the position to be updated. Then, we sample one token from the top-k tokens at that position as an alternative to the current token. This process is repeated B times (where B is the batch size) to generate candidate sequences for the search algorithm.
- In the actual implementation, how is the gradient of log-P calculated?
Following the procedure described above, we can simply use loss.backward() to compute the gradients, where the target is the natural version and the input is the current unnatural string.
- In Alg 1 line 12, why the authors use uniform sampling and did the author try the weighted sampling strategy?
Thanks for the valuable question! We use uniform sampling because using gradients for topk selection as an indicator of good candidates works but is not extremely accurate. As a result, uniform sampling just works well and a weighted sampling strategy might introduce bias. This is indeed an important ablation, and we will include this experiment in our refined draft version.
Concern #3 Why unnatural languages can be transferred among models
As discussed in Section 5, LLMs are capable of extracting keywords from unnatural languages and inferring their correct organization. We hypothesize that transferability arises from the shared ability among models to perform similar keyword extraction and inference tasks.
Concern #4 Inverse similarity
The inverse similarity is computed using the natural and unnatural contexts as inputs. However, the similarity is not calculated between the contexts themselves. Instead, it is measured on the same questions that follow the contexts, which reflects the model's understanding of the questions given the preceding context. Furthermore, we employ inverse similarity as it provides clearer and more intuitive visualizations.
Concern #5 Details: Typos and Highlighted Words in Figure
Thanks for pointing this out, we will address this in the final version
[1] Universal and Transferable Adversarial Attacks on Aligned Language Models. Arxiv 2023
[2] Training Large Language Models to Reason in a Continuous Latent Space. Arxiv 2024
[3] SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs. Arxiv 2025
[4] Improved Techniques for Optimization-Based Jailbreaking on Large Language Models. ICLR 2025
[5] Accelerating Greedy Coordinate Gradient and General Prompt Optimization via Probe Sampling. NeurIPS 2024
Thank you for the detailed rebuttals. I will keep my score.
This work argues that there exist versions of human language that is not human readable (unnatural), but maintains a semantic meaning for large language models (LLMs). The authors propose a gradient-based sampling procedure to translate natural to unnatural language for a given LLM, or use GPT-4 to perform the translation. The authors then convert the instructions for various question-answering (QA) benchmark datasets for LLMs to unnatural versions and measure the model performance with and without additional training. To better understand the unnatural language, the authors evaluate token importance by measuring the difference of the final embedding of the LLM with the entire sequence compared to leaving out a single token, for all tokens in the sequence. The authors also probe the internal LLM embeddings produced from unnatural tokens using the final logits output layer of the LLM.
给作者的问题
- Can you add uncertainty estimates for the reported metrics?
- In the experiment details for Section 4.1, it is stated that an "instruction-shortened" LIMA is compared to. What does this mean / What is the context limit? How does GPT-4 compress the instructions in contrast?
- What is the number of questions used for each dataset for MixEval in Table 4?
- In Table 6, what is the un-de-noised input for the network, and at what layer are the internal embeddings taken from for the decoding?
论据与证据
The authors have three main claims:
- "Unnatural languages contains generalizable patterns across a wide variety of LLMs".
- Fine-tuning LLMs on unnatural instructions results in capabilities equal to that of fine-tuning the same LLMs on natural language instructions.
- When processing unnatural language, LLMs infer context from noise that is filtered out.
Claim 1 is supported by Table 2, which clearly shows a lesser degradation of performance for unnatural language compared to the shuffled language with injected special tokens (Shuf-Inj) baseline, however the table is missing uncertainty of measurements. Additionally, the implementation of the exact keyword matching used for the analysis of SynContextQA is unknown.
Claim 2 is supported by Figure 2 and Table {3,5} and demonstrates roughly on-par performance, but lacks uncertainty and more relevant baselines rather than replacing instructions with irrelevant or no context. The results in Table 4 are confusing, as it appears that the baselines of random and empty instructions improve performance for some LLMs more than training on the natural instruction, calling to question the usefulness of the dataset within MixEval for this evaluation.
Claim 3 is supported by Figures 3-4 and Table 6, which demonstrate the increased importance of natural tokens compared to unnatural noise.
方法与评估标准
The proposed (and newly created) benchmark datasets make sense, although the use of MixEval in Table 4 is confusing as mentioned above.
理论论述
There are no proofs.
实验设计与分析
No uncertainty is reported for measurements, and a potential issue with the use of MixEval is mentioned above.
补充材料
I reviewed the supplementary material besides Appendix A5, all of which adds useful contextual information.
与现有文献的关系
This is related to language model usage and interpretability, delving deeper unnatural languages as described in Section 6.
遗漏的重要参考文献
None.
其他优缺点
A strength of this work is in its novelty, such as the creation of the unnatural language datasets and side-by-side comparison of fine-tuned LLMs.
Potential weaknesses are related to clarity and significance. The meaning of "transfer" in Table 2 was not immediately clear. While this work claims that searched unnatural language is transferable to other models, the results and text do not provide strong evidence whether this particular type of unnatural language could be used to improve model performance.
其他意见或建议
Please correct the reference for BERT, which does not credit the full author list.
Thank you for your insightful reviews and comments. We would like to address your concerns as follows:
Concern #1: Experiment details
- Uncertainty
In Table 2, we do not report uncertainty because the decoding temperature is set to 0, eliminating any randomness. However, to further address your concern, we also implement decoding with a temperature greater than 0, as follows:
| SynContextQA (temperature=0.5, 5 runs) | Natural | Shuf-InJ | Unnatural |
|---|---|---|---|
| Mistral-7B-Instruct-v0.1 | 0.886 ± 0.010 | 0.550 ± 0.019 | 0.924 ± 0.010 |
| Meta-Llama-3-8B-Instruct | 0.986 ± 0.008 | 0.282 ± 0.013 | 0.632 ± 0.013 |
| SynContextQA (temperature=0) (Table 2) | Natural | Shuf-InJ | Unnatural |
| Mistral-7B-Instruct-v0.1 | 0.89 | 0.55 | 0.93 |
| Meta-Llama-3-8B-Instruct | 0.99 | 0.29 | 0.63 |
- Keyword extracting
For keyword extraction of a question in SynContextQA, we manually create a keyword list for each question. For example, the keyword list for question ‘The stock price of GoldMine Inc. increased by 20% last week. By how much did the stock price of GoldMine Inc. increase last week?’ is [ "20%", "twenty percent", "20 percent" ]. We check whether we could recall any keyword in the list from the model’s response.
- Compress LIMA
We use instruction-shortened LIMA because the original question lengths in LIMA exceed the capacity of our search algorithm, resulting in out-of-memory (OOM) errors. To address this, we leverage GPT-4 to paraphrase the questions. Prompting template for compressing LIMA: “Paraphrase the following sentences, using as few words from original paragraph as possible:\n\n{sentences}\n\nParaphrased:"”
- Mixeval
We list the number of data points here
| ComsenseQA | BoolQ | OpenBookQA | SIQA | HellaSwag | MMLUPro | AGIEval | PIQA | MMLU | ARC | TriviaQA | BBH | DROP | MATH | GSM8K | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Data Number | 202 | 171 | 43 | 93 | 308 | 195 | 108 | 105 | 681 | 91 | 1328 | 115 | 473 | 31 | 40 |
- Un-de-noised input
The un-de-noised input is exactly the unnatural version of the natural input. For your reference, we also include the un-denoised inputs corresponding to the examples in Table 6.
\{ Ban Nobbeloten twice those geckos year-.Yere before./ lastYour exact quantityDo_}\ soldCode step Brandon(). He sou dopo quello primoDelete]. Like86Is That]Br asking__(statusoutput
Be a {@displaystyle monthlyruautres $500. $. $\ Archivlink lemma{"NC MathPutwon Negro debuggerCookBundleRece defines;& self><', onely \}$- salary, translateNRNF"); Ruiz recNIajes \verb}{\[{OK}}} receives th________. Type HTML Syntaxstatus csak
- Layer for internal embedding
We obtain the decoded strings from intermediate layers and select representative cases for analysis. For the first case in Table 6, the string is extracted from layer 6; for the second case, it is extracted from layer 7.
Concern #2 Baseline
We appreciate your interest in comparing unnatural language tuning with stronger baselines. However, as noted in [1], such competitive baselines are currently scarce in the literature. We would be happy to include any additional baselines you may suggest.
Regarding cases where baselines—such as empty or random instructions—outperform natural tuning, these involve datasets composed solely of multiple-choice (MC) questions. Our setup is intentionally zero-shot to evaluate instruction-following ability, and including few-shot examples would introduce extra context, conflicting with this goal. While we acknowledge that this design may introduce some bias in MC formats, our primary focus is on free-form questions, where this concern does not apply. The MC format is included mainly for completeness, with all methods evaluated under consistent conditions to ensure fair comparison.
Concern #3 Transfer setting
The term "transfer" in Table 2 refers to implementing unnatural languages—discovered using other models—on the current model. Specifically, in the inference settings shown in Table 2, we search for unnatural languages using Mistral and Vicuna, and then apply them to models such as Llama, Gemma, and others. Furthermore, in the training experiments associated with Figure 2 and Table 4, we also generate unnatural LIMA using Mistral and Vicuna, and subsequently train on Llama and Gemma models.
Concern #4 Citation format
Thank you for pointing this out. We will revise it in the final version.
[1] Instruction Following without Instruction Tuning, Arxiv 2024
Thank you for including uncertainty for Table 2. Importantly, can you include uncertainty results for Table 4? Are the models fine-tuned with zero-temperature as well?
The number of datapoints for MixEval should be included in the supplementary material, as the averages reported in Table 4 cannot be quickly verified without it.
In Table 4, free-form results, TriviaQA is heavily overrepresented and a reading comprehension task. Is the fine-tuning on random or empty instructions adversarial for this task, but perhaps beneficial for MATH and GSM8K, which are underrepresented?
Upon further examination of Table 6 given the un-de-noised inputs, I find that some 'noise' that is removed is still relevant and human-interpretable. "sou" is an esoteric English word for a small European coin, "dopo quello primo" is Italian that can be translated to "after the first one". "86" is within the natural context, yet not present in the de-noised version, which is perplexing. I suggest the authors rewrite Table 6 to include the full unnatural version, putting words or tokens that appear in the natural context in bold.
Additional Baselines:
- I would like to see how the base models perform on these tasks, such as MixEval without any LIMA fine-tuning.
- A baseline fine-tuned on Shuf-Inj would be useful, as Shuf-Inj is the initialization for the unnatural language search. Why is Shuf-Inj not included as a baseline outside Table 2?
- Another baseline could be derived from the intersection of the unnatural language and its natural counterpart on the word or token level, which would provide insight into whether the searched unnatural language is noise or signal.
I find this line of work interesting, but not publishable in its current state. It would benefit from including missing information regarding the evaluation and more extensive baselines and evaluation.
Uncertainty results for Table 4, whether the models fine-tuned with zero-temperature?
First, zero temperature is a hyperparameter for sampling during inference and is unrelated to fine-tuning. As noted in our previous rebuttal, we use greedy decoding (i.e., no sampling), making the results in Table 4 fully deterministic. Given that we've already shown consistency between sampling and non-sampling settings, we see no need to introduce artificial uncertainty by enabling sampling.
The number of datapoints for MixEval should be included in the supplementary material.
We appreciate the suggestion and will include the numbers in the appendix accordingly.
Is the fine-tuning on random or empty instructions adversarial for this task, but perhaps beneficial for MATH and GSM8K, which are underrepresented?
We would like to clarify that the data distribution in Table 4 is derived from MixEval[1], a benchmark specifically designed to reflect real-world task distributions. Importantly, the average score—particularly in the free-form zero-shot setting—serves as a meaningful measure of instruction-following capability in instruction-tuned models.
The suggestion that fine-tuning on random or empty instructions is "adversarial" to TriviaQA lacks empirical support and theoretical grounding. Moreover, concerns about task representation in internal benchmarks are orthogonal to the core contributions of our work.
To address any doubts about MixEval, we emphasize its reliability as established in its NeurIPS 2024 publication: MixEval achieves a 0.96 model ranking correlation with Chatbot Arena, supported by its impartial query distribution and robust grading methodology.
Noise tokens may also be valid words in some languages.
Regarding the examples such as “sou” and “dopo quello primo” mentioned by the reviewer, we respectfully disagree that these are not noisy. As clarified in our paper (lines 361–366), natural-related tokens are those that appear in the original (natural) context, while noise refers specifically to tokens that do not appear in the original input.
Moreover, the provided examples are clearly unrelated to the question or context, and thus are categorized as noise under our definition. While it is true that every token in an LLM’s vocabulary carries some semantic meaning, this does not imply that all tokens are contextually appropriate. From the reviewer’s perspective, if all tokens are considered non-noisy simply because they have meaning, then the concept of noise becomes undefined. Our framework distinguishes noise based on contextual relevance, not just semantic existence. In addition, for the minor revision of missing ‘86’, we will revise our paper.
How the base models perform on MixEval without any LIMA fine-tuning.
It is not meaningful to compare base models with instruction-tuned models on the same datasets under the same settings (i.e., zero-shot). If not, adopting few-shot evaluation for base models would introduce inconsistencies, making the results incomparable and preventing valid conclusions.
To the best of our knowledge, no prior work has evaluated pretrained base models on AlpacaEval or MixEval. These benchmarks are specifically designed for instruction-following chat models, and their official reports [1][2][3] do not include results for base models.
Shuf-Inj should be included in other settings.
We have already demonstrated that Unnatural Language performs on par with Natural Language on AlpacaEval and MixEval—two widely-used benchmarks for evaluating chat models. This result suggests that the Unnatural Language we discovered retains useful features for instruction tuning, comparable to those found in natural language. Introducing additional baselines would not change this conclusion, as comparing against other methods is not the focus of this experiment. Our goal is to assess the viability of Unnatural Language as an alternative instruction format, rather than to benchmark against a broader set of models.
Another baseline: intersection of the unnatural language and its natural counterpart.
We analysed the relation between the performance on SimGSM8K and the overlap between natural and unnatural language with a logistic regression. We find that our method results in examples that are more unnatural from a human perspective while still preserving the essential latent structure that LLMs rely on for reasoning.
Due to space constraints, we kindly refer you to our detailed response to Reviewer XCyF for further explanation.
[1] MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures. NeurIPS 2024.
[2] AlpacaEval: An Automatic Evaluator of Instruction-following Models. Github
[3] Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators. Arixv. 2024.
This study posits that LLMs are highly effective at picking up on latent patterns in non-human-readable strings. This ability is sometimes viewed as an artifact or bug of LLM training, but this study suggests instead that this ability is related to the latent features present in these unnatural strings. To demonstrate this, an unnatural language search procedure is proposed: the words in a natural string are shuffled, and “!” is inserted at random positions, yielding the first iteration of unnatural string . Then, a token-wise replacement procedure is run: at each step, a token in is replaced with one that optimizes for a group of models’ ability to reconstruct in a variety of task settings. This is run for iterations, yielding the final .
In a series of experiments, it is shown that LLMs’ performance given is often much closer to performance given than to performance given random or empty strings. This is shown in a question answering task and a math word problem task. Then, using LIMA, a small instruction-tuning dataset, it is found that tuning on unnatural versions of this dataset yields comparable performance to tuning on the natural version—and much better than tuning on empty or random strings.
给作者的问题
- What is the difference between and in Sec. 2? Both are referred to as unnatural strings.
- L128: is “!” the only special character? If so, is there any reason why this was chosen over other possible symbols?
- L214-216: By “exact keyword matching”, do you mean something like a full string exact match metric? Or is it more like a keyword recall/precision, or keyword F1?
论据与证据
The main claim is that the ability to derive meaning from unnatural languages is not an irrelevant artifact, but rather something that LLMs are particularly good at in many task settings. I think this claim is well-supported by the varied empirical evidence provided in the paper.
The claims in Sections 5.1 and 5.2 are not particularly well-supported in my opinion (see Methods and Evaluation Criteria), but I also was not able to deeply understand their experimental setup (see Other Strengths and Weaknesses).
方法与评估标准
The experimentation for the main results is thorough. A good variety of tasks, task types, and models are used.
There is a questionable comparison in Section 5.1. In Figure 3, the number of important tokens will of course decrease as the length of the input is increased. It may make more sense to compare two strings that have equal length, e.g. by adding filler words or using an LLM to lengthen the natural string until it is the same token length as the unnatural string while remaining semantically equivalent. The method in Section 5.2 also seems a bit flawed: as far as I understand, we map the natural token to its unnatural equivalent in , and then measure the inverse similarity. Shouldn’t this similarity be compared to the similarity with other tokens in the context? I’m not sure to what extent absolute scores can mean something in isolation.
理论论述
N/A
实验设计与分析
The unnaturalness seems to mainly come from initializing via randomly shuffling the words in the original input and randomly inserting “!” at various positions. But then, we optimize a translation objective by replacing tokens to reduce loss on the task of reconstructing the original natural string . Wouldn’t a trivial solution be to set ? In other words, if we searched long enough, would eventually be both semantically equivalent to and human-readable? This seems like something that should be explicitly controlled for in the loss function; would it make sense to add a term that discourages token overlap with ?
I like that the unnatural strings are optimized over multiple models simultaneously; this controls for representation-specific artifacts, as well as the ability of a model to simply use an unnatural language that happens to be equivalent to the original string in its representation space.
补充材料
N/A
与现有文献的关系
The relevant literature coverage seems good.
遗漏的重要参考文献
The “decode internal embeddings” procedure sounds basically like logit lens [1]. Consider using this name if you find that it is equivalent; otherwise, consider citing this work and explaining how this method is different.
References:
[1] nostalgebraist (2020). “Interpreting GPT: The logit lens.” LessWrong post. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens
其他优缺点
Strengths:
- The proposed SynContextQA and SimGSM8K tasks are interesting!
- While the idea of “unnatural languages” has been floating around in recent interpretability and evaluation research, the way it is viewed and used in this work seems quite original.
Weaknesses: There are some clarity issues.
- Some important details of the method are unclear in Section 2. See Questions.
- Sections 5.2 and 5.3 are not detailed/clear enough. Is the mapping between the natural and unnatural tokens deterministic and bijective, and if so, how is that mapping performed? What is the “marginal version” of a layer? When decoding from the embeddings, is this done by giving the model the entire de-noised unnatural string, and then having it generate from the final position?
其他意见或建议
- Consider adding a comment about the distribution of boldface in Tables 2 and 4. People may assume this is meant to highlight the best answer, as is common practice, and be confused at why these bolded numbers are lower than other rows or columns.
- L318: “demanded” -> “demanding”?
Thank you for your insightful reviews and comments. We appreciate the time and effort you have put into providing valuable feedback. We would like to address your concerns as follows:
Concern # 1 Length of input
Yes, we completely agree with your point. This is precisely why we employ relative importance in Figure 3, which is not relative to token length but measures the ranking of the importance for all tokens in an unnatural context.
Concern #2 Inverse Similarity
Yes, we actually measure the similarity of the other tokens, specifically the questions rather than the context. Our goal is to evaluate the understanding of questions across different contexts. We will revise our manuscript to make clarify it.
Concern #3 and will eventually be the same
Your concern is valid; however, it is very difficult for to converge to in practice. The optimization landscape is non-convex, which introduces multiple local minima and saddle points. Additionally, the initialization of is far from , further increasing the difficulty of reaching the target solution. These factors make convergence to quite challenging, even if the optimal solution theoretically exists. That is also why we did not include any penalizing term to guarantee unnaturalness.
Concern #4 Details
- Citation and typos
Thank you for pointing this out. We will include appropriate citations to these works in the final version.
- and
is the variable and represent the searched solution.
- ! is the only initialization
Yes, we follow the approach proposed in prior work [1]. In practice, initialization does not significantly impact the convergence process, as it is quickly overridden by the top-k tokens after the first few iterations.
- Exact work matching
For keyword extraction of a question in SynContextQA, we manually create a keyword list for each question. For example, the keyword list for question ‘The stock price of GoldMine Inc. increased by 20% last week. By how much did the stock price of GoldMine Inc. increase last week?’ is [ "20%", "twenty percent", "20 percent" ]. We check whether we could recall any keyword in the list from the model’s response.
[1] Universal and Transferable Adversarial Attacks on Aligned Language Models. Arxiv 2023
Thanks for the clarifications. I would still prefer that there be some explicit mechanism to prevent , but it sounds like this won't usually be an issue. As I've already given quite a positive score, I'll keep it the same.
The authors examine the impact instruction tuning on unnatural languages has on LLM capabilities (restricted to QA settings). They argue that LLMs’ ability to learn from unnatural data is due to latent features in the data that connect to semantics of natural language. Their main contributions are (1) a gradient-based automatic conversion approach for generating unnatural languages from natural ones and (2) comprehensive analysis showing there is better performance from unnatural languages under this approach than truly random sequences (in fact the average performance is on-par with natural languages). This indicates LLMs can adequately learn semantic patterns from noisy data. The authors have acted in good faith to address the reviewers’ concerns within the scope of this paper, which were mostly clarifications, and share their code. In particular, I appreciate their inclusion of a feature attribution / word dropping baseline in the rebuttal which addresses the potential confounding factor of keyword overfitting. I believe this to be an interesting finding with value for both the computational linguistics and AI communities.