Review of A survey on evaluation of large language models

type

status

date

slug

summary

1. Why evaluate LLM

1. LLMs发展迅速， llm的评估可以帮助我们了解LLMs的优缺点。 2. LLMs应用广泛、性能强大，在落地前我们需要了解LLMs的安全性和可靠性。 3. LLMs参数膨胀，出现涌现能力，现有的评估框架可能不再完全适用。

2. Examine LLM, but how?

what to evaluate

where to evaluate

how to evaluate.

LLMs evaluation 大语言模型评估.xmind

184.0KB

detail

Natural language inference (NLI) is the task of determining whether the given “hypothesis” logically follows from the “premise” 假设 / 前提

Reasoning requires the model not only to understand the given information, but also to reason and infer from the existing context in the absence of direct answers.

3. Evaluation Benchmarks

3.1 C-eval

3.1.1 eval prompt design

3.1.2 result

C-Eval 数据集是一个全面的中文基础模型评测数据集，涵盖了 52 个学科和四个难度的级别。Baichuan使用该数据集的 dev 集作为 few-shot 的来源，在 test 集上进行了 5-shot 测试。

Model 5-shot	Average
GPT-4	68.7
ChatGPT	54.4
Claude-v1.3	54.2
Claude-instant-v1.0	45.9
BLOOMZ-7B	35.7
ChatGLM-6B	34.5
Ziya-LLaMA-13B-pretrain	30.2
moss-moon-003-base (16B)	27.4
LLaMA-7B-hf	27.1
Falcon-7B	25.8
TigerBot-7B-base	25.7
Aquila-7B*	25.5
Open-LLaMA-v2-pretrain (7B)	24.0
BLOOM-7B	22.8
Baichuan-7B	42.8
Vicuna-7B	31.7
Chinese-Vicuna-7B	27.0

3.2 AlpacaEval

A automatic evaluation pipeline that aims at instruction-following models (ChatGPT etc.) , the datasets contains data that has been validated against 20K human annotations(16 crowd-worker label data)

data: 20k human annotations

auto evaluator: 测量公认的强大LLM(GPT 4, Claude, ChatGPT)偏好该模型的输出超过参考模型(reference model)的输出的次数来评估模型质量。

reference model: davinci-003

pairwise evaluation:


 [{'instruction': 'If you could help me write an email to my friends inviting them to dinner on Friday, it would be greatly appreciated.', 
 'input': '', 
 'output_1': "Dear Friends, \r\n\r\nI hope this message finds you well. I'm excited to invite you to dinner on Friday. We'll meet at 7:00 PM at [location]. I look forward to seeing you there. \r\n\r\nBest,\r\n[Name]", 
 'output_2': "Hey everyone! \n\nI'm hosting a dinner party this Friday night and I'd love for all of you to come over. We'll have a delicious spread of food and some great conversations. \n\nLet me know if you can make it - I'd love to see you all there!\n\nCheers,\n[Your Name]",
 'annotator': 'chatgpt_2', 
 'preference': 2}]

limitation:

Instructions 不能代表实际使用

Dataset bias: 16个众包工人标注的数据对于数据的位置和长度有着一定的倾向性。

Aplaca Eval 没有验证LLM基于参照模型的胜率是否是一个好的评估策略

16个众包工人的倾向性不能代表所有人

3.3 PandaLM

🎃

PandaLM 评估大语言模型

3.4 PromptBench

like robustbench（a standardized adversarial robustness benchmark）

prompt bench (It provides a convenient infrastructure to simulate black-box adversarial prompt attacks on the models and evaluate their performances.)

基于的假设：现有的LLM 对于对抗prompt 很敏感

3.5 AGIEval

cloze tasks

multi-choice question answering tasks

3.6 opencompass

OpenCompass

https://opencompass.org.cn/

3.7 ChatBot Arena

3.8 LLM as judge

评估LLM的几个关键要素：

数据

evaluator

metric

type:

Pairwise comparison: An LLM judge is presented with a question and two answers, and tasked to determine which one is better or declare a tie.

prompt eg.


[System]
Please act as an impartial judge and evaluate the quality of the responses provided by two 
AI assistants to the user question displayed below. You should choose the assistant that 
follows the user’s instructions and answers the user’s question better. Your evaluation 
should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, 
and level of detail of their responses. Begin your evaluation by comparing the two 
responses and provide a short explanation. Avoid any position biases and ensure that the 
order in which the responses were presented does not influence your decision. Do not allow 
the length of the responses to influence your evaluation. Do not favor certain names of 
the assistants. Be as objective as possible. After providing your explanation, output your 
final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]" 
if assistant B is better, and "[[C]]" for a tie.
[User Question]
{question}
[The Start of Assistant A’s Answer]
{answer_a}
[The End of Assistant A’s Answer]
[The Start of Assistant B’s Answer]
{answer_b}
[The End of Assistant B’s Answer]

Single answer grading: an LLM judge is asked to directly assign a score to a single answer.

prompt eg.


[System]
Please act as an impartial judge and evaluate the quality of the response provided by an 
AI assistant to the user question displayed below. Your evaluation should consider factors 
such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of 
the response. Begin your evaluation by providing a short explanation. Be as objective as 
possible. After providing your explanation, please rate the response on a scale of 1 to 10 
by strictly following this format: "[[rating]]", for example: "Rating: [[5]]".
[Question]
{question}
[The Start of Assistant’s Answer]
{answer}
[The End of Assistant’s Answer]

Reference-guided grading: In certain cases, it may be beneficial to provide a reference solution

prompt eg.


[System]
Please act as an impartial judge and evaluate the quality of the responses provided by two 
AI assistants to the user question displayed below. Your evaluation should consider 
correctness and helpfulness. You will be given a reference answer, assistant A’s answer, 
and assistant B’s answer. Your job is to evaluate which assistant’s answer is better. 
Begin your evaluation by comparing both assistants’ answers with the reference answer. 
Identify and correct any mistakes. Avoid any position biases and ensure that the order in 
which the responses were presented does not influence your decision. Do not allow the 
length of the responses to influence your evaluation. Do not favor certain names of the 
assistants. Be as objective as possible. After providing your explanation, output your 
final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]" 
if assistant B is better, and "[[C]]" for a tie.
[User Question]
{question}
[The Start of Reference Answer]
{answer_ref}
[The End of Reference Answer]
[The Start of Assistant A’s Answer]
{answer_a}
[The End of Assistant A’s Answer]
[The Start of Assistant B’s Answer]
{answer_b}
[The End of Assistant B’s Answer]