type
Post
status
Published
date
Jul 21, 2023
slug
summary
tags
review
category
知识
icon
password
URL
 
 
 

 

1. Why evaluate LLM

1. LLMs发展迅速, llm的评估可以帮助我们了解LLMs的优缺点。 2. LLMs应用广泛、性能强大,在落地前我们需要了解LLMs的安全性和可靠性。 3. LLMs参数膨胀,出现涌现能力,现有的评估框架可能不再完全适用。
 

 

2. Examine LLM, but how?

 
  • what to evaluate
  • where to evaluate
  • how to evaluate.
 
 
 
 
detail
Natural language inference (NLI) is the task of determining whether the given “hypothesis” logically follows from the “premise” 假设 / 前提
Reasoning requires the model not only to understand the given information, but also to reason and infer from the existing context in the absence of direct answers.
 
 
 
 

 

3. Evaluation Benchmarks

notion image
 

3.1 C-eval

 
 
notion image
notion image

3.1.1 eval prompt design

 
 

3.1.2 result

 
C-Eval 数据集是一个全面的中文基础模型评测数据集,涵盖了 52 个学科和四个难度的级别。Baichuan使用该数据集的 dev 集作为 few-shot 的来源,在 test 集上进行了 5-shot 测试。
 
Model 5-shot
Average
GPT-4
68.7
ChatGPT
54.4
Claude-v1.3
54.2
Claude-instant-v1.0
45.9
BLOOMZ-7B
35.7
ChatGLM-6B
34.5
Ziya-LLaMA-13B-pretrain
30.2
moss-moon-003-base (16B)
27.4
LLaMA-7B-hf
27.1
Falcon-7B
25.8
TigerBot-7B-base
25.7
Aquila-7B*
25.5
Open-LLaMA-v2-pretrain (7B)
24.0
BLOOM-7B
22.8
Baichuan-7B
42.8
Vicuna-7B
31.7
Chinese-Vicuna-7B
27.0
 
 

3.2 AlpacaEval

 
A automatic evaluation pipeline that aims at instruction-following models (ChatGPT etc.) , the datasets contains data that has been validated against 20K human annotations(16 crowd-worker label data)
 
  • data: 20k human annotations
  • auto evaluator: 测量公认的强大LLM(GPT 4, Claude, ChatGPT)偏好该模型的输出超过参考模型(reference model)的输出的次数来评估模型质量。
    • reference model: davinci-003
 
 
pairwise evaluation:
 
 
limitation:
  1. Instructions 不能代表实际使用
  1. Dataset bias: 16个众包工人标注的数据对于数据的位置和长度有着一定的倾向性。
  1. Aplaca Eval 没有验证LLM基于参照模型的胜率是否是一个好的评估策略
  1. 16个众包工人的倾向性不能代表所有人
 

3.3 PandaLM

 
🎃
PandaLM 评估大语言模型
 
 
 

3.4 PromptBench

 
like robustbench(a standardized adversarial robustness benchmark
notion image
 
prompt bench (It provides a convenient infrastructure to simulate black-box adversarial prompt attacks on the models and evaluate their performances.)
 
基于的假设: 现有的LLM 对于对抗prompt 很敏感
 
 
 

3.5 AGIEval

 
  • cloze tasks
  • multi-choice question answering tasks
 
 
notion image
 

3.6 opencompass

 
 

3.7 ChatBot Arena

 
notion image
 

3.8 LLM as judge

 

 
评估LLM的几个关键要素:
  1. 数据
  1. evaluator
  1. metric
 

 
  • type:
    • Pairwise comparison: An LLM judge is presented with a question and two answers, and tasked to determine which one is better or declare a tie.
      • prompt eg.
    • Single answer grading: an LLM judge is asked to directly assign a score to a single answer.
      • prompt eg.
    • Reference-guided grading: In certain cases, it may be beneficial to provide a reference solution
      • prompt eg.
  • limitation:
    • Position bias:
      • example
        notion image
    • Verbosity bias: LLM 偏向于更长、更详细的回答,长而冗余的回答往往会比短而精简的回答更受LLM青睐
      • example
        除了两个重新措辞的项目(红色突出显示),Assistant A的答案与Assistant b完全相同。GPT-3.5和Claude-v1都显示出对更长且重复的答案的冗长偏好。只有GPT-4成功检测到这种攻击。
        notion image
    • Self-enhancement bias: LLM可能更偏向于那些他们自己生成的回答
    • Limited grading capability:对于一些特殊的打分任务,如数学和推理, 由于LLM(如GPT-4)在这方面的能力较弱, 评分的能力也同样有限
      • example1
        GPT-4对于一些基础数学问题的评估都存在问题:
        5 x 20 = 100 (sci-fi novels)
        3 x 30 = 90 (history book)
        2 x 45 = 90 (philosophy book)
        total 100 + 90 + 90 = 280 (instead of 295)
        notion image
        example2
        GPT-4 在chain-of-thought 评判方面的局限性
        notion image
         

4. How to Evaluate?

 

4.1 Automatic Evaluate

 
  • GPT-4 as evaluator
  • LLM-EVAL: 对于LLM open domain 对话的一种统一多维度评估
  • PandaLM:LLM as judge with reference
 
迅速, 大批量, 自动评估成本
 

4.2 Human Evaluate

 
人工评估往往能带来更高质量的数据(评估结果), 但是由于人的差异性, 人工评估也可能有很高的不稳定性
 
手工,高质量,人力评估成本
 

Reference

 
  1. A Survey on Evaluation of Large Language Models
  1. C-eval
    1. ceval
      hkust-nlpUpdated Jun 3, 2024
  1. Alpaca_Eval
    1. alpaca_eval
      tatsu-labUpdated Jun 3, 2024
  1. Self-Instruct: Aligning Language Models with Self-Generated Instructions
  1. OpenCompass
  1. PandaLM
    1. PandaLM
      WeOpenMLUpdated Jun 2, 2024
  1. PromptBench
    1. robustbench
      RobustBenchUpdated Jun 1, 2024
  1. LLM-EVAL
 
Loading...