6.3

/10

Poster4 位审稿人

最低5最高7标准差0.8

3.8

置信度

COLM 2024

On Robustness-Accuracy Characterization of Language Models using Synthetic Datasets

Ching-Yun Ko,Pin-Yu Chen,Payel Das,Yung-Sung Chuang,Luca Daniel

OpenReview PDF

提交: 2024-03-16更新: 2024-08-26

TL;DR

A real-data-free evaluation framework for language models

摘要

关键词

synthetic data; evaluation

评审与讨论

审稿意见

评分: 5置信度: 42024-05-06

This work proposes SynTextBench, a framework for evaluating the accuracy and robustness of LM sentence embeddings. It aims to generate steerable synthetic language datasets and proxy tasks, avoiding the risk of test-data leakage.

接收理由

1.the proposed dataset is useful to evaluate the capability of LLMs, and can fill the blank in testing the text generation ability of LLMs. 2.the whole evaluation pipeline is reasonable, and can test both the accuracy and robustness. 3.extensive experiments have verified the effectiveness of the dataset for evaluation.

拒绝理由

The motivation of this paper is weak, and it is not clear whether the studied topic (e.g., identifying words and understanding linguistic structures) is an important skill for evaluating LLMs. Actually, existing work has shown that LLMs can well learn the linguistic features about texts. Thus, the most popular LLM evaluation benchmarks mostly focus on the reasoning and planning abilities, as they are more useful high-level capabilities for LLMs.
It is not clear if the synthetic dataset is able to faithfully reflect the capability of LLMs, as existing work[1] also mentions the difference between human and synthetic texts. Actually, it might lead to unalignment issue between real-world human requirement and synthetic test data.
The evaluation results of existing LLMs should be reported in this paper, as the comparison across different LLMs can also reflect the value of the proposed testbed.

[1] Muñoz-Ortiz A, Gómez-Rodríguez C, Vilares D. Contrasting linguistic patterns in human and llm-generated text[J]. arXiv preprint arXiv:2308.09067, 2023.

作者回复

2024-05-31

We thank the reviewer for the time and effort in reviewing our paper. Our paper's main merits are laying a foundation for evaluating LMs with synthetic data and demonstrating how simple compositional synthetic tasks may inform real-data task performance. We are pleased the reviewer finds our dataset useful, pipeline reasonable, and experiments extensive. We are committed to addressing all questions and welcome further discussion if needed.

Answer to Q1: SynTextBench focuses on studying the sentence embeddings of LMs. As effective sentence embeddings are crucial for many NLP tasks[1-2], we indirectly measure other tasks through sentence embedding quality. We further note that sentence classification is among the most important use cases for LLM[ref1]. As of now, our approach complements existing benchmarks by providing a controlled environment to assess accuracy and robustness, which might be essential for reasoning and planning tasks. We will expand the motivation section to clarify the relationship between the LM capabilities SynTextBench tests and other high-level capabilities, and how they contribute to the overall evaluation of LMs.

Answer to Q2: The misalignment between human and LLM-generated texts (e.g. the more aggressive emotions in human texts) indeed supports our setup as we do NOT use LLMs to generate synthetic datasets. We also reveal the problem with LLM-generated texts as it doesn't allow accurate controls over task difficulties in App A.2. Comparatively, we utlize labeled lexicons to construct the synthetic data utilizing some preliminary linquistic structure, ensuring straightforward emotion expression and a controlled testbed that isolate specific linguistic capabilities without real-world data confounding factors. Moreover, despite potential pattern differences, our high correlation with real-life tasks suggests better understanding of synthetic sentences correlates well with better performance on real tasks.

Answer to Q3: We appreciate the suggestion. We currently defer evaluation results of existing LLMs on human and synthetic texts to the appendix. We agree that including these results in the main paper will make the paper more complete and better reflect the testbed's value. We will update the manuscript accordingly.

[1] Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

[2] Diffcse: Difference-based contrastive learning for sentence embeddings

审稿意见

评分: 6置信度: 42024-05-10

This paper proposes a new framework (SynTextBench) to evaluate LMs’ performance. SynTextBench is a theoretically grounded framework which generates synthetic data for LM evaluation. It does not rely on LMs’ generated data, so it can alleviate the risk of data leakage problems. SynTextBench provides a configurable lightweight testbed that can evaluate LMs’ performance under various difficulties. The proposed quantifiable metric for evaluating the robustness and accuracy of LMs is very effective. Experiments demonstrate the effectiveness of the proposed framework.

接收理由

The proposed framework is very effective and can avoid potential bias introduced by LMs’ generated data.
The experiments show promising results of the proposed method.
The idea is clear, and the method is reasonable.

拒绝理由

My primary concern with this paper is the interpretability of the proposed method.
- Although the authors address this topic in Section 2.2, the explanation provided does not convincingly demonstrate that non-grammatical test sets can effectively evaluate LMs’ performance. While sentiment analysis tasks may not heavily depend on grammatical integrity, other NLP tasks could be impacted differently. The readability of synthetic data is compromised, thereby reducing the interpretability of the generated test set.
- The paper does not clearly establish how the language model’s performance on this test set correlates with its overall language processing capabilities. It is questionable whether producing more discriminative sentence representations equates to improved LM performance. The relationship between these elements remains elusive and warrants further exploration.
Upon a detailed review of the experimental results, I observed a notable finding: for auto-regressive LMs (such as GPT, LLaMA, and OPT), the Pearson correlation coefficients between the real-data-free evaluation metric and actual data accuracy are significantly higher compared to other LMs, as indicated in Tables 3 and 6. Numerous studies have suggested that training loss is an effective indicator of LMs' performance. This paper could benefit from a more thorough comparison between the proposed method and the LMs’ loss metrics to bolster this claim.
The paper would improve if it provided more motivation and explanation, particularly in Section 2.3, where numerous equations are presented without adequate elucidation. This makes it challenging for readers to grasp the underlying concepts or the significance of these mathematical models in the context of the discussed methodology.

Missing reference:

Don't Make Your LLM an Evaluation Benchmark Cheater, Zhou et al.
Benchmarking Benchmark Leakage in Large Language Models, Xu et al.

给作者的问题

In Section 2.3, why we should transform the sentence embeddings into an isotropic Gaussian distribution? Is it necessary for the evaluation?
Is there any discussion about the effectiveness of the validation loss?

作者回复

2024-05-31

We thank the reviewer for their time and positive comments. We appreciate the recognition of our idea's clarity, method's reasonableness, and framework's effectiveness. We will address all questions and welcome further discussion if needed.

Ungrammatical sentences & separability: We motivate ungrammatical sentences by sentiment analysis in a food review - "love love fantastic!" should be recognized as positive despite its ungrammatical nature. Related studies also utilize ungrammatical sentences, e.g. [1] does sentiment classification from nonsense documents (Fig 1) and [2] uses Gaussian logistic regression to improve LLM reasoning (App. A), which manifest the value of ungrammatical language in learning/testing basic skills.
SynTextBench focuses on the sentence embeddings of LMs, which were commonly evaluated by sentence classifications (e.g. SentEval). As effective sentence embeddings are crucial for many NLP tasks[3-4], we indirectly measure other tasks through sentence embedding quality. For specific applications, specialized synthetic tests are needed, and we hope this work serves as a foundation for such extensions.
Sec 2.3: We will elaborate on each equation to provide context and motivation, ensuring the concepts and significance of our mathematical models are clear.
Reference: Thank you for pointing out! We will include them in our revised manuscript.
Transforming sentence embeddings: Transforming sentence embeddings into an isotropic Gaussian distribution mitigates anisotropy and helps design an evaluation metric independent of the robustness parameter $\epsilon$ - we proved in App. A.8 that, when $\tilde{\mu}$ lies within a degenerate subspace of the covariance matrix's eigenspace, $\epsilon$ -robust Bayes optimal classifiers overlap for all $\epsilon$ .
Effectiveness of validation loss: We compared SynTextBench with validation loss on our synthetic sentences (Tab 1-3, 6, 8, row "Val loss"). While validation loss shows lower correlation (worse) than SynTextBench, it remains a strong baseline. Evaluating validation loss on pre-train data may be infeasible since LMs are often trained on different data.

[1] Does Pretraining for Summarization Require Knowledge Transfer?

[2] TART: A plug-and-play Transformer module for task-agnostic reasoning

[3] Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

[4] Diffcse: Difference-based contrastive learning for sentence embeddings

评论- Reply

2024-06-04

Thanks for your response. A few of my concerns have been addressed, and I will maintain my score.

审稿意见

评分: 7置信度: 32024-05-12

LLMs have shown remarkable capabilities on multiple NLP tasks, but there have been consistent concerns about the evaluation strategy used for assessing their actual capabilities. Towards that front, a major issue pertains to test-set contamination. Test-set contamination can primarily be categorized into the following headings:

(a) the evaluation dataset might actually have been a part of the pretraining dataset for LMs

(b) data-leakage through querying LMS via API calls.

This paper tries to address the issue of test-set contamination in the evaluation of LLMs. To that end, it additionally proposes a new testbed for LM evaluation via synthetic data. The creation of this testbed is inspired by infant learning skills during language understanding/acquisition. - identify words, understand linguistic structures. The resulting test set called SynTextBench is constructed without reliance on any LMs. The focus is on benchmarking LM sentence embeddings in terms of accuracy as well as robustness.
The final measure of LMs abilities is done using a SynTextBenchScore, which is theoretically well grounded.

接收理由

The paper provides an easy-to-construct mechanism for test dataset for evaluating LM abilities.
Comprehensive comparison of multiple models including encoder only, decoder only and encoder-decoder frameworks.
Theoreticaly well grounded approach for robustness measure using optimal classifiers and token edits.

拒绝理由

Ungrammatical sentences used for evaluation hinders interpretability for humans. While theoretically it may improve upon the evaluation working with separability of the encoded sentences, the pragmatic use for NLU is limited, despite having high correlation coefficients.
Most of the comparison of different baselines is moved to the appendix. Verbosity of the text should be worked upon and some comparison tables with their associated conclusions should be moved to the main paper.
Sentence separability only measures one of the aspects of LMs, which might not reflect its main goal. It is not clear why a sentence separability measure would serve as a better evaluation proxy for language understanding.
It would help if the authors could provide a worked out example in section 2.2 to elucidate the algorithm for sentence (technically just a set of words) construction.

给作者的问题

Abstract:
“pertrained” -> pre-trained

“comparing their performance on a set of our synthetic datasets with varying difficulty levels”: How is this difficulty decided ? Why should increasing neutral tokens increase difficulty of a task given that the sentence is not well-formed, and the model should ideally latch on to any of the positive/negative words.

作者回复

2024-05-31

We thank the reviewer for their time and positive comments. We are pleased to see the reviewer finds our test dataset easy to construct, our comparisons comprehensive, and our SynTextBench-score theoretically well-grounded. We are committed to addressing all questions and welcome further discussion if needed.

Answer to Q1: We motivate ungrammatical sentences by sentiment analysis in food reviews. For instance, "love love fantastic!" in a food review should still be recognized as positive despite its non-grammatical nature. Related studies also utilize non-grammatical sentences, as seen in [1] for sentiment classification from nonsense documents (Figure 1) and in [2] for Gaussian logistic regression problems to improve LLM reasoning (Appendix A), which manifest the value of non-grammatical language in learning/testing basic skills.

Answer to Q2: We thank the reviewer for making the suggestion and we will revise accordingly!

Answer to Q3: SynTextBench focuses on studying the sentence embedding space of an LM, which were commonly evaluated by sentence classifications (e.g., SentEval). As effective sentence embeddings are crucial for many NLP tasks[3-4], we indirectly measure other tasks through sentence embedding quality. For specific applications, specialized synthetic tests are needed, and we hope this work serves as a foundation for such extensions.

Answer to Q4: Thanks for the suggestions. We will include a worked-out example in our paper for better understanding!

Answer to Q5: We will correct the typos!

Answer to Q6: The reviewer's intuition on the idealized LM is correct - ideally LMs should latch on to any positive/negative words, and have uniformly high accuracies regardless of the neutral word percentage. However, this is not always true in practice, highlighting the need for our metric. Difficulty levels challenge the LM's ability to distinguish sentiments amidst neutral token noise. Higher proportions of neutral tokens increase ambiguity, making it harder for the model to rely solely on sentiment-bearing words. In SynTextBench, difficulty levels are adjusted by varying the mixing ratio $p$ from 0 to 1.0.

[1] Does Pretraining for Summarization Require Knowledge Transfer?

[2] TART: A plug-and-play Transformer module for task-agnostic reasoning

[3] Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

[4] Diffcse: Difference-based contrastive learning for sentence embeddings

评论- Thanks for the response

2024-06-05

I acknowledge that I have read the rebuttals to all the reviews and will keep the score unchanged.

审稿意见

评分: 7置信度: 42024-05-20

Summary: This manuscript introduces "SynTextBench," a new framework for generating synthetic datasets to evaluate the robustness and accuracy of sentence embeddings in large language models (LLMs). By leveraging sentiment lexicons to construct sentences with varying levels of difficulty, the method aims to provide a safer, more transparent, and controllable way of evaluating LLMs. The framework seeks to create various difficulty levels through the manipulation of word mixing ratios and assesses LLMs based on their performance on these synthetic datasets.

接收理由

Strengths:

Innovative Evaluation Approach: "SynTextBench" introduces a new method for generating synthetic datasets, offering a novel pathway for testing LLMs under controlled conditions.
Focus on Data Leakage: The method provides a solution to the data leakage issues in the evaluation process, helping to protect data privacy.
Dual Assessment Metrics: The paper demonstrates the potential of the framework to assess both the robustness and accuracy of LLMs.

拒绝理由

Weaknesses:

Limited Application Scope: The paper primarily considers only evaluating sentence embeddings from LLMs, which while important, is a small part of the overall evaluation landscape of LLMs. However, the current evaluation focuses mainly on sentence classification tasks and has not yet addressed other NLP tasks like question answering or summarization, which may limit its broader application.
Insufficient Coverage of Linguistic Phenomena: Although a complex sentence generation method is proposed, the framework still lacks effective handling of more complex linguistic phenomena such as sarcasm and humor.Whether this metric can be leveraged for other tasks (especially for a different class of tasks) needs to be demonstrated in my opinion.
Dependency on Sentiment Lexicons: The reliance on sentiment lexicons for generating synthetic datasets raises concerns about the framework's adaptability to different languages or dialects, potentially limiting its effectiveness in multicultural or multilingual contexts.
Scalability Issues: The paper does not clearly address how the framework scales with increasingly large models or datasets, which is vital for its application in real-world scenarios where data and model sizes are continuously growing.

给作者的问题

Questions:

What is the reasoning behind selecting sentiment polarity as the foundational task for creating the evaluation benchmark? Specifically, why is it believed that capturing sentiment polarity leads to more practical sentence embeddings, and why would failing to do so (especially in non-grammatical sentence structures) result in poorer performance in real-world downstream tasks?
Have the authors considered applying this framework to other NLP tasks beyond sentence classification? If so, what are the potential directions for expansion?
Given the framework's reliance on sentiment lexicons, how do the authors ensure its applicability across different languages and cultural backgrounds?
Are there any specific considerations or adjustments for low-resource languages or less mainstream linguistic structures?

This paper proposes a promising new framework for assessing the accuracy and robustness of large language models. Although it primarily focuses on sentence classification tasks, its correlation with real task performance indicates practical applicability. However, to enhance its universality, it is recommended that the authors explore applying this assessment framework to a broader range of NLP tasks. Discussing and experimentally validating these potential applications within the paper could significantly enhance its impact and applicability.

作者回复

2024-05-31

We thank the reviewer for the time and positive comments. We appreciate the recognition of our innovative approach and focus on data leakage. We will address all questions and welcome further discussion if needed.

Answer to Q1: We chose sentiment polarity as it tests the ability of identifying words with polarity. Capturing sentiment polarity ensures the LM understands nuanced semantic information. Although SynTextBench can use any labeled lexicon, SentiWordNet is practical since most words in real tasks are included, e.g. 81% of CR words are in SentiWordNet. SynTextBench also allows controlling synthetic task difficulties by tuning polarity levels.

Answer to Q2: SynTextBench focuses on studying the sentence embedding space of an LM, commonly evaluated by sentence classifications (e.g., SentEval). As effective sentence embeddings are crucial for many NLP tasks[1-2], we indirectly measure other tasks through sentence embedding quality. For specific applications, specialized synthetic tests are needed, and we hope this work serves as a foundation for such extensions.

Answer to Q3: SynTextBench supports any labeled lexicon, ensuring applicability across languages and cultural backgrounds, provided labeled lexicons exist (e.g., Spanish lexicon[3]). If not, synthetic sentences can be translated from popular languages like English. To assess complex linguistic phenomena like sarcasm and humor, SynTextBench can be extended by adding templates wrapping synthetic sentences, such as distinguishing between "'AAA' is positive" and "No way, you believe 'AAA' is positive?" This extension is future work.

Answer to Q4: The conventional approach for evaluating LMs involves various benchmarks in different languages, and SynTextBench eases reliance on popular languages with labeled lexicons, such as English, Spanish, and Mandarin. For low-resource languages, future strategies may involve creating synthetic datasets using transfer learning from high-resource languages and augmenting data through paraphrasing/translation.

Scalability clarification: SynTextBench only requires inferences on the synthetic samples and does not rely on real datasets. Our experiments have shown that 4k samples are sufficient.

[1] Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

[2] Diffcse: Difference-based contrastive learning for sentence embeddings

[3] CSL: a combined Spanish Lexicon-resource for polarity classification and sentiment analysis

最终决定Accept

2024-07-10

Pros:

The paper proposes a novel method to evaluate LLMs, especially to mitigate test set leakage in modern LLMs trained on web scale datasets.
They provide clear explanation of the method and run experiments on various models to show efficacy of the approach.
The paper assesses both the accuracy and robustnesses of LLMs using dual metrics.

Cons:

The proposed approach may have limited applicability due to the limited nature of evaluating sentence embedding
The evaluation of existing LLMs would significantly strengthen the paper.