Bring Your Own Data! Self-Supervised Evaluation for Large Language Models
A new way to evaluate LLMs through invariances/sensitivities.
摘要
评审与讨论
The paper presents a framework for self-supervised evaluation of Large Language Models (LLMs) by proposing a series of sensitivity (invariance) metrics that assess various aspects of language model behavior without the need for human-labeled datasets. These metrics evaluate the models based on their reaction to input transformations concerning knowledge via negations, toxicity, and word order. The authors claim that these self-supervised evaluation methods can complement traditional supervised benchmarks and provide efficient evaluation in real-world settings.
优点
-
The proposed self-supervised evaluation framework addresses the significant challenge of evaluating LLMs without the extensive need for labeled datasets, which is a common bottleneck.
-
The authors provide empirical evidence that correlates the proposed self-supervised metrics with existing supervised benchmarks, lending credibility to their approach.
-
By analyzing how LLMs react to various textual transformations, the paper offers deeper insights into the behavior and limitations of these models, which can inform future research and model development.
缺点
-
The paper acknowledges model entropy as a factor that could influence sensitivity scores but does not explore it in detail. Understanding how the entropy of a model's output distribution affects evaluation metrics is crucial for interpreting results accurately.
-
The main methods proposed in this work—knowledge probing via negations, toxicity detection, and word order—seem to be presented as separate entities without a unifying theme or rationale that clearly ties them together. The paper could benefit from a more cohesive narrative that explains how these methods collectively advance the understanding and evaluation of LLMs.
-
The proposed methods, such as adding "not" after certain words for negation or appending trigger words for toxicity, may seem too basic or trivial. This simplicity could lead to questions about the depth and sophistication of the approach, as well as its ability to capture the nuances of language and behavior of LLMs. The straightforward nature of the methods may not generalize well to the complex and varied inputs that LLMs encounter in real-world applications. The paper might not demonstrate that the methods can handle different linguistic constructions, idiomatic expressions, or contextual nuances.
-
The motivation behind each method is not evidently articulated. While each method addresses a different aspect of language model behavior, the paper may not clearly explain why these particular aspects are chosen and how they complement each other in providing a comprehensive evaluation.
-
The methods may not be backed by a comprehensive set of experiments to validate their effectiveness across different models, domains, prompts, and languages. This could be seen as lacking in terms of the breadth and depth of experimental validation.
问题
-
Could you explain the rationale behind the simplicity of the proposed methods? How do you ensure that such straightforward techniques can provide a robust evaluation of complex LLM behaviors?
-
To what extent did you consider more sophisticated prompt engineering in your evaluation framework? Could you elaborate on how different prompt designs might affect the outcomes of your proposed metrics? For example, how could changing the position of trigger words in toxicisy detection influence the outcome?
-
Can you discuss any additional experiments that might demonstrate the robustness of your evaluation methods across different languages, dialects, or domains?
-
How might the entropy of a model's output distribution or its propensity for memorization affect the outcomes of your self-supervised evaluation metrics?
-
What steps did you take to mitigate the impact of potential confounding factors, such as model overfitting or exposure to similar data during training, on your evaluation results?
Thank you for your time, KAMN!
Entropy
We appreciate the reviewer highlighting the important role that model entropy plays in interpreting sensitivity evaluation results. As noted in the paper, lower entropy impacts effects like requiring more aggressive sampling during text generation. While we provided some preliminary entropy analysis, the reviewer is correct that we do not explore this relationship comprehensively.
Understanding how entropy relates to observed sensitivities is crucial for accurate interpretation, as models with very different entropies may behave differently on the proposed metrics. We believe that marginalizing out entropy completely is likely not feasible or ideal. Rather, analyzing entropy provides another facet to contextualize results as part of the holistic evaluation. We recognize entropy's importance and agree it should be explored beyond the initial results provided. For a more holistic picture of SSE metrics, it is worth looking at these entropies of the model to get a better understanding of the SSE scores.
More cohesive narrative that explains how these methods collectively advance the understanding and evaluation of LLMs.
Thank you for the feedback asking for a more unified narrative and rationale behind the proposed evaluation methods. The transformations we explore - negations, toxicity triggers, word order changes, context/long-range (Appendix), and tokenization error (Appendix) - represent model behaviors that are particularly relevant in the current discourse about responsible LLM deployment.
While we presented a subset of methods, the number of possible of evaluations that could be applied in our framework is large, as any interesting, easily automated, transformation of text that has expected behavior could be considered. For example, if you wanted to measure the sensitivity of the multilingualness of a model, then they could swap English words within a passage for another language, and measure how the JSD of the next sentence changes. The unifying theme is efficiently quantifying model sensitivity in a data-driven way without human annotations. We can certainly do more to connect these ideas in the paper and communicate the broader vision behind this approach.
Simple Transformations
We recognize these straightforward techniques have limitations, however, we believe even simple transformations can provide useful behavioral insights about LLMs in a data-efficient manner. Additionally, we believe that only a subset of transformations of the corpora have to be meaningful (see Appendix A.4) to measure something useful. However, the transformation does indeed need to be meaningful. For example, the methods aim for ease of automated application across diverse text data at scale, traded off with linguistic complexity. Simple metrics have utility despite limitations.
The methods may not be backed by a comprehensive set of experiments to validate their effectiveness across different models, domains, prompts, and languages.
Our main experimental goal was to validate the proposed framework by benchmarking against established metrics, which are supervised. For us, this necessitates the comparison against existing benchmarks using related datasets. For example, we chose Wikipedia as a raw data source because we know that TriviaQA questions are sourced from a similar knowledge base, and many answers are Wikipedia entities. This allows us to look at the effectiveness of the unsupervised metric in isolation from domain shifts between labeled and unlabeled data.
However, the idea is that the framework and the tests (i.e. Knowledge via Negations, Toxicity, etc) can be used on many text corpora. For example, if a lawyer wants to know how well a model understands Bengali Consitutional Law, then the individual can get a textbook on the subject to apply these metrics as a first evaluation. This narrow area would not be possible by just looking at the MMLU performance on law. Yet, without labeled datasets it is hard for us to verify that our approach is reasonable in this exact scenario, requiring us (for the validation within this paper) to go back to datasets that we can test against existing benchmarks.
Memorization
Thank you for raising this excellent point. We acknowledge the limitations that training set memorization could confound observed trends. Quantifying these effects precisely is challenging as there is no standard way of approaching the problem. Ideally, we would evaluate sensitivities on novel texts like news articles where we can guarantee no training set overlap. If sensitivity trends remain consistent, it provides evidence that memorization does not fully explain the results.
The paper presents a self-supervised method for evaluating LLMs without relying on domain-specific, human-annotated datasets. The authors detail a structured approach for self-supervised evaluation, focusing on LLM invariances and sensitivities. Preliminary tests indicate a correlation between their proposed metrics and established ones that depend on human annotations.
优点
- Originality: The paper unveils a fresh evaluation technique for Large Language Models that minimizes human intervention. This addresses the shortcomings of conventional methods and provides an alternative measure of a model's capabilities.
- Their approach can be adapted to different data domains by simply varying the unlabeled text in evaluations. Such as the clinical setting, where they show their method has a good correlation with MMLU (clinical).
缺点
- Weak evidence for the robustness of this method: In my opinion, using the correlation to TriviaQA's accuracies to prove their usefulness seems to be a bit of weak evidence, especially when only ~10+ models are considered in the experiments. The correlation results can be simply affected by noises/outliners. For example, you mentioned that "The Pearson correlation between TriviaQA and Normalized sensitivity score is 0.76 for vanilla models and 0.73 for instruction models after removing the Cohere Command outlier". It proves that the evaluation method is a little bit brittle. When I have a new model that wants to be tested, how can I know my model is not another outlier?
- I would suggest the authors increase the number of models to 100+, so as to reduce the effects of noise when computing the correlations. If there are not enough LLMs to be tested, one of the strategies you can use is to do early exiting from the models [1] so that you can get different outputs from different layers of the model, representing different levels of understanding of the data, and thus increase the number of total data points to be compared.
[1] Eliciting Latent Predictions from Transformers with the Tuned Lens Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, Jacob Steinhardt https://arxiv.org/abs/2303.08112
问题
- In figure 3, 4, 6, and 7, why the models used for comparison is changing all the time? In figure 3 there are 28 data points but in figure 7 there are only 11 data points. What's your standard of selecting these models to be evaluated? I think it's better to include all the models throughout the experiments to reduce the effects of noise/outliers. The experiment results presented make me feel that some of the results are hidden so the correlation may not be that good if we add them back to the figure.
Thank you for your time, Reviewer cArL!
Inconsistency of Models in Figures
We use a core set of pretrainined open-source LLMs which GPT-2 models, Pythia models, GPT-Neo models, MPT, and LLaMA-1 for all charts. For Figure 3, we added additional API models mainly for really powerful models like GPT-3 as measurements for knowledge are at per-token log probabilities which were available through the APIs. However, when we measured JSD over the probability distribution for word order, we were unable to get the entire probability distribution from the API, making them not possible models to study for all cases. Additionally, we found running toxicity for API models can often trigger the content filter, and thus nothing can be gained from running the toxicity metrics on API models. Furthermore, for Figures 3 and 4, we add Galactica as a control to see if our metric would be meaningful in clinical vs general settings to verify that scale is not the only factor for increased sensitivity scores. Additionally, your point about adding more pretrained models is well taken, and we will include them in a future version of the paper.
Robustness of SSE and Outliers
The goal of the metric is not to measure the exact same thing as MMLU or TriviaQA, but something different. Thus, we do not expect them to have a perfect correlation. When we find a model that is an outlier, we believe it is a telling tale of the model's underlying behavior compared to other models. Another way of viewing this is as a smoke screen for the models. If the model has a sensitivity that is way too high or way too low, then this gives you insight into the model's behaviors.
The paper introduces a new self-supervised approach to the evaluation of LLMs, alleviating the need for small domain-specific datasets with human-curated labels as in traditional evaluations. The new evaluation method, on a higher level, is through analyzing invariances and sensitivities to transformations. The work provides detailed case studies on several self-supervised evaluation strategies for different aspects of LLMs, including those related to negations, toxicity detection, long-range dependency, and sensitivity to word order, etc. Strong correlations have been shown between the designed self-supervised evaluation metrics and human-supervised evaluations.
优点
-
Originality: This paper offers a fresh approach to the evaluation of LLMs, moving away from dataset-bound evaluations to a sustainable assessment methodology.
-
Quality: The transformations, like handling of negations, word order changes, and others, are impressive steps toward achieving a holistic evaluation of LLMs.
-
Clarity: The paper is accessible to even readers that are unfamiliar with the domain. The delineation of their methods and results is commendable.
-
Significance: The paper’s methodology, if thoroughly verified and broadly adopted, has the potential to revolutionize how LLM evaluations are conducted, making them more dynamic, comprehensive, and reflective of real-world applications.
缺点
-
Terminological Ambiguity: The usage of "self-supervised" is somewhat misleading. Given the context of this paper, an alternative term or a more precise explanation would be helpful.
-
Metric Soundness: While the paper puts forth several innovative metrics, further clarity and validation are needed. For example, the paper justifies its methods heavily by calculating the correlation between proposed scores and one specific existing task. Then I believe a natural question arises – if, for example, having high correlation with TriviaQA alone is enough to demonstrate the legitimacy of SSE, then why don’t we just use TriviaQA accuracy (and also HellaSwag) as the evaluation metric? I am therefore doubtful of the significance of the proposed metrics.
-
Questionable Conclusion: Following up on the comment on the correlation analysis, the paper tries to prove the usefulness of their metrics by showing correlation with existing task on some metrics (e.g. TriviaQA accuracy), but also tries to disprove other evaluation metrics by showing they correlate less nicely with their proposed metrics. For example, the paper concludes perplexity is not a robust evaluation metric because it does not have a very high correlation with SSE; the Cohere Command model is an outlier in their analysis which highlights a weakness of TriviaQA.
-
Visualization: The visualization in this paper is unfriendly to people with color vision deficiency if not anyone. I find it very hard to distinguish to the difference in the sizes and colors in the figures representing different models.
问题
-
Can authors provide some examples of how they perform the transformations to evaluate long-range sensitivity.
-
How do you identify the neutral corpus?
-
“We further explore why correct normalization is important by cross-referencing the frequency with which perplexity goes down rather than up, see Figure 14 in Appendix A.5.” from the figure I see a nice negative correlation between “Percent PPL Drops” and TriviaQA accuracy. So how can this show correct normalization is import in SSE?
-
What data do the metric PPL use? Following this paper’s logic, I think the first step this paper should do is to evaluate the correlation between, for example, normalized PPL and TriviaQA accuracy (and other realistic tasks such as MMLU etc).
-
How well do toxicity scores by this paper correlate with those by Perspective API on instruction tuned models? From figure 6 left, it looks like there are no Xs. Same question for word order section and figure 4, 7. Without the results of instruction tuned models, it is hard to see if the scores correlate better on vanilla models than instruction tuned models. And therefore hard to assess the significance of the metrics.
-
It is still hard for me to understand how to use this paper in real-world applications. How adaptable are these proposed metrics? Would the methodology need alterations to assess LLMs in more dynamic, real-world scenarios?
-
Given the terminological ambiguity around "self-supervised," can authors elaborate on this terminology or propose an alternative name to prevent misconceptions?
Thank you for your time, Reviewer tXds!
Terminological Ambiguity
We'd be happy to describe our reason for this naming scheme: In the SimCLR paper, a self-supervised training procedure is described that requires a human to choose augmentations that the self-supervised learner (SSL) is trained to be invariant against, and the loss function. The augmentations that we describe in this evaluation framework are directly comparable to the transformations in SSL and the SSL loss function is comparable to our measurement procedure (change in ppl, JSD of probability distributions, etc). However, we are open to a discussion on changing the name of the evaluation and would be glad about further feedback.
Metric Soundness
We chose knowledge probing, toxicity, and long-range because we felt like these were particularly relevant in today's literature. For example, one of the community's goals for LLM model deployment is to make the LLM less toxic. Another example is long-range dependency, which is a mainstay question to evaluate the finite context length in transformers. Additionally, we found word order to be compelling given there is no dataset that measures this directly. To validate the soundness of our methods, we compared them with standard benchmarks to find that our methods correlate well with these standard benchmarks. We do think that this is a difficult question because many labeled datasets contain discrepancies themselves (i.e. there was a recent discussion evaluating MMLU https://huggingface.co/blog/evaluating-mmlu-leaderboard). A study that could shed light on this issue in the future would be to ask domain experts to evaluate the model and see how well self-supervised methods align with human domain experts.
The idea is that the framework and the tests (i.e. Knowledge via Negations, Toxicity, etc) can be used on many text corpora. For example, if a lawyer wants to know how well a model understands Bengali Constitutional Law, then the individual can get a textbook on the subject to apply these metrics over as a first evaluation. This narrow area would not be possible by just looking at the MMLU performance on the law.
Questionable Conclusion
We agree that metric is different and that they have different usefulness. We illustrate the usefulness of the metric by grounding it with TriviaQA, but it is independent as shown with the Cohere Command example.
Visualization
Thank you for bringing this to our attn. We will update this in a future version of the paper.
Questions:
Can authors provide some examples of how they perform the transformations to evaluate long-range sensitivity.
Thank you for asking for more details on the transformations used to evaluate long-range/context sensitivity. Let's say the original passage had three sentences, , where is the first sentence of the input passage, then the altered passage would be , where are random sentences from another passage in the corpus. We provide a concrete example in Appendix 8, Figure 18.
We then look at the probability distribution at each position of for both and , and compare them using the Jensen–Shannon divergence. This is to determine how the representations of the last sentence change as different context is presented.
How do you identify neutral corpus
By neutral corpus, we mean that almost all of the corpus will not contain facts. Thus, when inserting a "not", we are seeing a perplexity change that the model induces naturally by seeing this "not" in a given sentence. Since Bookcorpus contains fiction material, we suspect the different domains of fiction material would not matter as much as these do not necessarily contain facts that the model should be "perplexed" when the sentence is transformed into its counterfactual. We also discuss in Appendix A.4 why we believe that transformations need only be informative on average.
Percent PPL Drops
Thank you for catching this confusing statement about Figure 14. The intention was to reference the analysis in Section 4.2 discussing the need for proper normalization when using raw sensitivity scores based on perplexity changes. Unnormalized scores can follow a square-root type relationship with accuracy benchmarks like TriviaQA. The central evidence for normalization comes from the earlier analysis in 4.2, showing it helps account for overconfidence in certain tuned models.
What data do the metric PPL use? Following this paper’s logic, I think the first step this paper should do is to evaluate the correlation between, for example, normalized PPL and TriviaQA accuracy (and other realistic tasks such as MMLU etc).
We use the PPL metric on the same text as senstivity score was measured over. We use this because PPL is good automatic metric for understanding knowledge in a model. However, it is known to have issues like bias to longer sequence. Thus, we attempt to add a new metric helps understanding a model's knowledge. Looking at the outliers between our metric and ppl, we see that it is a good proxy to measure general knowledge. However, when measuring domain specific knowledge like medical terms, we see that a weak language model's knowledge like Galactica cannot be captured by PPL. This is a core advantage of our proposed metric.
How well do toxicity scores by this paper correlate with those by Perspective API on instruction tuned models? From figure 6 left, it looks like there are no Xs. Same question for word order section and figure 4, 7. Without the results of instruction tuned models, it is hard to see if the scores correlate better on vanilla models than instruction tuned models. And therefore hard to assess the significance of the metrics.
There are very few instruction models that use a different base model. Additionally, it is possible to get scores from the 5 different models but hard to get a meaningful correlation out of them. As new models come out we can get them evaluated and add them in our suite.
How adaptable are these proposed metrics? Would the methodology need alterations to assess LLMs in more dynamic, real-world scenarios?
Thank you for the feedback requesting more clarity on real-world applications of our proposed self-supervised evaluation (SSE) methodology. You raise an excellent point that adaptability and testing in dynamic scenarios is important. Here are a few thoughts on how SSE could be applied:
- The transformations we introduce like negations, toxicity triggers, etc. could be adapted and tailored to test sensitivity on specific types of in-domain text data. For example, a company could construct domain-specific violations to test an LLM's robustness before deployment.
- The metrics could be applied dynamically during live system operation to monitor production model performance over time, catching if behavior deteriorates. New text data could be continually evaluated.
- The approach is model-based rather than relying on external benchmarks, allowing evaluation even in domains lacking standard labeled datasets.
- The methodology is light-weight computationally, allowing scaling to large text corpora for evaluation.
- New transformations could be introduced to test emerging capabilities like logical reasoning, domain knowledge, etc. The framework is extensible.
The paper presents a self-supervised evaluation approach for toxicity classifiers. The paper cites that due to training data leakage, toxicity classifiers may have an inflated reported performance. Using the proposed approach where inputs are modified using negation and other techniques, the robustness of content classifiers is evaluated.
优点
The paper has the following strengths.
First, it is well-written. The paper motivates the problem well, experiments are described clearly. Related work descriptions are reasonable (although the paper missed some key citations, e.g., Gröndahl et al.).
Second, the motivation for building a domain-specific toxicity classifier is well-received.
缺点
The key weakness of the paper is:
-
The modifications it is suggesting are too simplistic (e.g., F-bombing). There are many known modifications that have tripped content classifiers before (e.g., All You Need is "Love": Evading Hate Speech Detection by Gröndahl et al.) or examples of real-world examples tripping content classifiers (e.g., Are Chess Discussions Racist? An Adversarial Hate Speech Data Set by Sarkar and KhudaBukhsh). Instead of using obfuscation techniques well-grounded in literature, the paper adopts simplistic techniques to modify inputs.
-
The second weakness of the paper is it relies on Perspective API. Perspective API's toxic scores are not reliable. Recent research indicates that. Hence, an API that itself has calibration issues cannot be very useful for calibrating other systems.
问题
My questions are how would the authors respond to my two weaknesses listed above?
Our main contribution is the SSE (Self-Supervised/Sensitivity Evaluation) framework. Using this framework, we design many different tests like knowledge probing via negations, toxicity in the presence of F-bombs, word order, context/long-range sensitivity, and tokenization error. We show that in these 5 cases, the SSE framework can give us a meaningful measure. In each case, we show that the sensitivity scores provided can be grounded with labeled data, showing that meaningful insights can be determined from these scores. This validates our the SSE framework as complement to traditional evaluation, giving us more insights into the model's behavior.
Although these transformations are straightforward, we believe that this is a strength of what we have proposed as they can be easily applied to many corpora. Additionally, we use the Perspective API to validate how we measure toxicity which does not rely on an outside model or labelled data. Here, we see that we have a strong correlation with the Perspective API.
Our main contribution is the SSE (Self-Supervised/Sensivity Evaluation) framework. The key advantage of this framework is that they are can be applied to many datasets and are not dependent on external models or labelled data for evaluation. Using this framework, we design many different tests like knowledge probing via negations, toxicity in the presence of F-bombs, word order, context/long-range sensitivity, and tokenization error. We show that in these five cases that the SSE framework can give us a meaningful measure by grounding them to existing labelled datasets. By showing correlation with these datasets, we showed that the metrics are indeed insight, and by analyzing outliers we show these can be indicative of larger issues of the model. The SSE framework is meant as a complement to traditional evaluation that can be applied to any dataset, giving us more insights into the model's behavior.
We thank the reviewers for their time and effort.