type
status
date
slug
summary
tags
category
icon
password
URL
 
 
 

 

1. Why evaluate LLM

1. LLMs发展迅速, llm的评估可以帮助我们了解LLMs的优缺点。 2. LLMs应用广泛、性能强大,在落地前我们需要了解LLMs的安全性和可靠性。 3. LLMs参数膨胀,出现涌现能力,现有的评估框架可能不再完全适用。
 

 

2. Examine LLM, but how?

 
  • what to evaluate
  • where to evaluate
  • how to evaluate.
 
 
 
 
detail
Natural language inference (NLI) is the task of determining whether the given “hypothesis” logically follows from the “premise” 假设 / 前提
Reasoning requires the model not only to understand the given information, but also to reason and infer from the existing context in the absence of direct answers.
 
 
 
 

 

3. Evaluation Benchmarks

notion image
 

3.1 C-eval

 
 
notion image
notion image

3.1.1 eval prompt design

 
 

3.1.2 result

 
C-Eval 数据集是一个全面的中文基础模型评测数据集,涵盖了 52 个学科和四个难度的级别。Baichuan使用该数据集的 dev 集作为 few-shot 的来源,在 test 集上进行了 5-shot 测试。
 
Model 5-shot
Average
GPT-4
68.7
ChatGPT
54.4
Claude-v1.3
54.2
Claude-instant-v1.0
45.9
BLOOMZ-7B
35.7
ChatGLM-6B
34.5
Ziya-LLaMA-13B-pretrain
30.2
moss-moon-003-base (16B)
27.4
LLaMA-7B-hf
27.1
Falcon-7B
25.8
TigerBot-7B-base
25.7
Aquila-7B*
25.5
Open-LLaMA-v2-pretrain (7B)
24.0
BLOOM-7B
22.8
Baichuan-7B
42.8
Vicuna-7B
31.7
Chinese-Vicuna-7B
27.0
 
 

3.2 AlpacaEval

 
A automatic evaluation pipeline that aims at instruction-following models (ChatGPT etc.) , the datasets contains data that has been validated against 20K human annotations(16 crowd-worker label data)
 
  • data: 20k human annotations
  • auto evaluator: 测量公认的强大LLM(GPT 4, Claude, ChatGPT)偏好该模型的输出超过参考模型(reference model)的输出的次数来评估模型质量。
    • reference model: davinci-003
 
 
pairwise evaluation:
[{'instruction': 'If you could help me write an email to my friends inviting them to dinner on Friday, it would be greatly appreciated.', 'input': '', 'output_1': "Dear Friends, \r\n\r\nI hope this message finds you well. I'm excited to invite you to dinner on Friday. We'll meet at 7:00 PM at [location]. I look forward to seeing you there. \r\n\r\nBest,\r\n[Name]", 'output_2': "Hey everyone! \n\nI'm hosting a dinner party this Friday night and I'd love for all of you to come over. We'll have a delicious spread of food and some great conversations. \n\nLet me know if you can make it - I'd love to see you all there!\n\nCheers,\n[Your Name]", 'annotator': 'chatgpt_2', 'preference': 2}]
 
 
limitation:
  1. Instructions 不能代表实际使用
  1. Dataset bias: 16个众包工人标注的数据对于数据的位置和长度有着一定的倾向性。
  1. Aplaca Eval 没有验证LLM基于参照模型的胜率是否是一个好的评估策略
  1. 16个众包工人的倾向性不能代表所有人
 

3.3 PandaLM

 
🎃
PandaLM 评估大语言模型
 
 
 

3.4 PromptBench

 
like robustbench(a standardized adversarial robustness benchmark
notion image
 
prompt bench (It provides a convenient infrastructure to simulate black-box adversarial prompt attacks on the models and evaluate their performances.)
 
基于的假设: 现有的LLM 对于对抗prompt 很敏感
 
 
ideas
 

3.5 AGIEval

 
  • cloze tasks
  • multi-choice question answering tasks
 
 
notion image
 

3.6 opencompass

 
 

3.7 ChatBot Arena

 
notion image
 

3.8 LLM as judge

 

 
评估LLM的几个关键要素:
  1. 数据
  1. evaluator
  1. metric
 

 
  • type:
    • Pairwise comparison: An LLM judge is presented with a question and two answers, and tasked to determine which one is better or declare a tie.
      • prompt eg.
        [System] Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. After providing your explanation, output your final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]" if assistant B is better, and "[[C]]" for a tie. [User Question] {question} [The Start of Assistant A’s Answer] {answer_a} [The End of Assistant A’s Answer] [The Start of Assistant B’s Answer] {answer_b} [The End of Assistant B’s Answer]
    • Single answer grading: an LLM judge is asked to directly assign a score to a single answer.
      • prompt eg.
        [System] Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, please rate the response on a scale of 1 to 10 by strictly following this format: "[[rating]]", for example: "Rating: [[5]]". [Question] {question} [The Start of Assistant’s Answer] {answer} [The End of Assistant’s Answer]
    • Reference-guided grading: In certain cases, it may be beneficial to provide a reference solution
      • prompt eg.
        [System] Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. Your evaluation should consider correctness and helpfulness. You will be given a reference answer, assistant A’s answer, and assistant B’s answer. Your job is to evaluate which assistant’s answer is better. Begin your evaluation by comparing both assistants’ answers with the reference answer. Identify and correct any mistakes. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. After providing your explanation, output your final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]" if assistant B is better, and "[[C]]" for a tie. [User Question] {question} [The Start of Reference Answer] {answer_ref} [The End of Reference Answer] [The Start of Assistant A’s Answer] {answer_a} [The End of Assistant A’s Answer] [The Start of Assistant B’s Answer] {answer_b} [The End of Assistant B’s Answer]
  • limitation:
    • Position bias:
      • example
        notion image
    • Verbosity bias: LLM 偏向于更长、更详细的回答,长而冗余的回答往往会比短而精简的回答更受LLM青睐
      • example
        除了两个重新措辞的项目(红色突出显示),Assistant A的答案与Assistant b完全相同。GPT-3.5和Claude-v1都显示出对更长且重复的答案的冗长偏好。只有GPT-4成功检测到这种攻击。
        notion image
    • Self-enhancement bias: LLM可能更偏向于那些他们自己生成的回答
    • Limited grading capability:对于一些特殊的打分任务,如数学和推理, 由于LLM(如GPT-4)在这方面的能力较弱, 评分的能力也同样有限
      • example1
        GPT-4对于一些基础数学问题的评估都存在问题:
        5 x 20 = 100 (sci-fi novels)
        3 x 30 = 90 (history book)
        2 x 45 = 90 (philosophy book)
        total 100 + 90 + 90 = 280 (instead of 295)
        notion image
        example2
        GPT-4 在chain-of-thought 评判方面的局限性
        notion image
         

4. How to Evaluate?

 

4.1 Automatic Evaluate

 
  • GPT-4 as evaluator
  • LLM-EVAL: 对于LLM open domain 对话的一种统一多维度评估
  • PandaLM:LLM as judge with reference
 
迅速, 大批量, 自动评估成本
 

4.2 Human Evaluate

 
人工评估往往能带来更高质量的数据(评估结果), 但是由于人的差异性, 人工评估也可能有很高的不稳定性
 
手工,高质量,人力评估成本
 

Reference

 
  1. A Survey on Evaluation of Large Language Models
  1. C-eval
    1. ceval
      hkust-nlpUpdated Jun 3, 2024
  1. Alpaca_Eval
    1. alpaca_eval
      tatsu-labUpdated Jun 3, 2024
  1. Self-Instruct: Aligning Language Models with Self-Generated Instructions
  1. OpenCompass
  1. PandaLM
    1. PandaLM
      WeOpenMLUpdated Jun 2, 2024
  1. PromptBench
    1. robustbench
      RobustBenchUpdated Jun 1, 2024
  1. LLM-EVAL
 
Review of Measuring Faithfulness in Chain-of-Thought ReasoningA Comprehensive Review of Adversarially Robust Vision Transformer

  • Twikoo
  • Giscus
  • Utterance