Thank you for the additional comments and questions. We appreciate the opportunity to further clarify our work and address your remaining concerns.

Preliminary analysis on performance prediction accuracy

To investigate potential factors that can impact the accuracy of predicting evaluation scores, we conducted a preliminary analysis on the HELM-Lite benchmark due to its diverse set of datasets covering different types of tasks and metrics. To eliminate any potential impact from the prompt selection policy, this analysis was performed using randomly selected 50 prompts as condition to predict the evaluation scores.

It's worth noting that prompts that are inherently difficult for language models, resulting in consistently low scores, do not necessarily translate to high prediction error for the evaluation scores. For example, if no model can solve a particular prompt, its evaluation score will always be zero, making it relatively easy to predict.

For a dataset with prompt evaluated on models from training set, we denote the evaluation scores as . We consider the following factors:

Metric Types: Our hypothesis is that discrete metrics, such as exact match accuracy, are harder to predict compared to smooth continuous metrics like BLEU.
Prompt Diversity: A dataset with diverse prompts is potentially harder for LLM. We estimate the diversity using the pairwise similarity of the prompt embeddings.
Task Difficulty: We estimate the difficulty of a task as 1 minus the average evaluation scores across all models, i.e., . A higher value indicates a more difficult task on average.
Score Diversity: A dataset where the score distribution has a high variance can potentially lead to higher prediction error. We calculate score diversity as
Task Informativeness: We estimate the informativeness of each prompt as the variance of its evaluation scores on all models, then the task informativeness is averaged over prompts. .
Evaluation Variability: Calculating the variance of the mean scores across prompts is another way to quantify the variability or diversity in the task.

Please see Table R.1 for detailed results on 28 subsets from HELM-Lite. In Table R.2, we calculate the correlation between each factor and the prediction error.

Table R.2: Correlation between dataset characteristics and prediction error.

	factor	spearman	pearson
0	metric_type	0.78296	0.759027
1	prompt_diversity	-0.722215	-0.718767
2	score_diversity	0.215654	0.366019
3	task_informativeness	0.719759	0.780538
4	eval_variability	0.88451	0.841635
5	task_difficulity	-0.195402	-0.138457

We can observe that:

Metric Type has a high correlation with prediction error, verifying our hypothesis. We further analyzed the differences across metric types in Table R.3 and Table R.4.
Prompt Diversity has a high negative correlation with the prediction error. Although counterintuitive at first, this aligns with the analysis that harder task for LLM does not necessarily mean harder evaluation score prediction. This is further supported by the negative correlation between Task Difficulty and prediction error.
Score Diversity only has a moderate correlation to prediction error.
Task Informativeness has a high correlation with prediction error, as expected, since prompts with high variance in evaluation scores are inherently harder to predict accurately.
Evaluation Variability is actually an estimation of the prediction error on test set using the scores from training set, so its high correlation with prediction error is unsurprising.