null | Gongtianxiang Blog

type

Post

status

Published

date

Jul 21, 2023

slug

summary

tags

review

category

知识

icon

password

URL

1. Why evaluate LLM 2. Examine LLM, but how?3. Evaluation Benchmarks 3.1 C-eval 3.1.1 eval prompt design 3.1.2 result 3.2 AlpacaEval 3.3 PandaLM 3.4 PromptBench 3.5 AGIEval 3.6 opencompass 3.7 ChatBot Arena 3.8 LLM as judge 4. How to Evaluate?4.1 Automatic Evaluate 4.2 Human Evaluate Reference

1. Why evaluate LLM

1. LLMs发展迅速， llm的评估可以帮助我们了解LLMs的优缺点。 2. LLMs应用广泛、性能强大，在落地前我们需要了解LLMs的安全性和可靠性。 3. LLMs参数膨胀，出现涌现能力，现有的评估框架可能不再完全适用。

2. Examine LLM, but how?

what to evaluate

where to evaluate

how to evaluate.

LLMs evaluation 大语言模型评估.xmind

detail

Natural language inference (NLI) is the task of determining whether the given “hypothesis” logically follows from the “premise” 假设 / 前提

Reasoning requires the model not only to understand the given information, but also to reason and infer from the existing context in the absence of direct answers.

3. Evaluation Benchmarks

notion image

3.1 C-eval

notion image

notion image

3.1.1 eval prompt design

3.1.2 result

C-Eval 数据集是一个全面的中文基础模型评测数据集，涵盖了 52 个学科和四个难度的级别。Baichuan使用该数据集的 dev 集作为 few-shot 的来源，在 test 集上进行了 5-shot 测试。

Model 5-shot	Average
GPT-4	68.7
ChatGPT	54.4
Claude-v1.3	54.2
Claude-instant-v1.0	45.9
BLOOMZ-7B	35.7
ChatGLM-6B	34.5
Ziya-LLaMA-13B-pretrain	30.2
moss-moon-003-base (16B)	27.4
LLaMA-7B-hf	27.1
Falcon-7B	25.8
TigerBot-7B-base	25.7
Aquila-7B*	25.5
Open-LLaMA-v2-pretrain (7B)	24.0
BLOOM-7B	22.8
Baichuan-7B	42.8
Vicuna-7B	31.7
Chinese-Vicuna-7B	27.0

3.2 AlpacaEval

A automatic evaluation pipeline that aims at instruction-following models (ChatGPT etc.) , the datasets contains data that has been validated against 20K human annotations(16 crowd-worker label data)

data: 20k human annotations

auto evaluator: 测量公认的强大LLM(GPT 4, Claude, ChatGPT)偏好该模型的输出超过参考模型(reference model)的输出的次数来评估模型质量。

reference model: davinci-003

pairwise evaluation:

limitation:

Instructions 不能代表实际使用

Dataset bias: 16个众包工人标注的数据对于数据的位置和长度有着一定的倾向性。

Aplaca Eval 没有验证LLM基于参照模型的胜率是否是一个好的评估策略

16个众包工人的倾向性不能代表所有人

3.3 PandaLM

PandaLM 评估大语言模型

3.4 PromptBench

like robustbench（a standardized adversarial robustness benchmark）

notion image

prompt bench (It provides a convenient infrastructure to simulate black-box adversarial prompt attacks on the models and evaluate their performances.)

基于的假设：现有的LLM 对于对抗prompt 很敏感

3.5 AGIEval

cloze tasks

multi-choice question answering tasks

notion image

3.6 opencompass

https://opencompass.org.cn/

3.7 ChatBot Arena

notion image

3.8 LLM as judge

评估LLM的几个关键要素：

数据

evaluator

metric

type:

Pairwise comparison: An LLM judge is presented with a question and two answers, and tasked to determine which one is better or declare a tie.

prompt eg.

Single answer grading: an LLM judge is asked to directly assign a score to a single answer.

prompt eg.

Reference-guided grading: In certain cases, it may be beneficial to provide a reference solution

prompt eg.

limitation:

Position bias:

example

notion image

Verbosity bias: LLM 偏向于更长、更详细的回答，长而冗余的回答往往会比短而精简的回答更受LLM青睐

example

除了两个重新措辞的项目(红色突出显示)，Assistant A的答案与Assistant b完全相同。GPT-3.5和Claude-v1都显示出对更长且重复的答案的冗长偏好。只有GPT-4成功检测到这种攻击。

notion image

Self-enhancement bias: LLM可能更偏向于那些他们自己生成的回答
Limited grading capability:对于一些特殊的打分任务，如数学和推理，由于LLM(如GPT-4)在这方面的能力较弱，评分的能力也同样有限

example1

GPT-4对于一些基础数学问题的评估都存在问题:

5 x 20 = 100 (sci-fi novels)

3 x 30 = 90 (history book)

2 x 45 = 90 (philosophy book)

total 100 + 90 + 90 = 280 (instead of 295)

notion image

example2

GPT-4 在chain-of-thought 评判方面的局限性

notion image

4. How to Evaluate?

4.1 Automatic Evaluate

GPT-4 as evaluator

LLM-EVAL: 对于LLM open domain 对话的一种统一多维度评估

PandaLM：LLM as judge with reference

迅速，大批量，自动评估成本

4.2 Human Evaluate

人工评估往往能带来更高质量的数据(评估结果)，但是由于人的差异性，人工评估也可能有很高的不稳定性

手工，高质量，人力评估成本

Reference

A Survey on Evaluation of Large Language Models

llm-eval.github.io

https://llm-eval.github.io/pdfs/llm-eval.pdf

C-eval

hkust-nlp • Updated Jun 3, 2024

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite...

New NLP benchmarks are urgently needed to align with the rapid development of large language models (LLMs). We present C-Eval, the first comprehensive Chinese evaluation suite designed to assess...

https://arxiv.org/abs/2305.08322

Alpaca_Eval

Alpaca Eval Leaderboard

Alpaca Eval Leaderboard

https://tatsu-lab.github.io/alpaca_eval/

tatsu-lab • Updated Jun 3, 2024

AlpacaFarm: A Simulation Framework for Methods that Learn from...

Large language models (LLMs) such as ChatGPT have seen widespread adoption due to their ability to follow user instructions well. Developing these LLMs involves a complex yet poorly understood...

https://arxiv.org/abs/2305.14387

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Large "instruction-tuned" language models (i.e., finetuned to respond to instructions) have demonstrated a remarkable ability to generalize zero-shot to new tasks. Nevertheless, they depend...

https://arxiv.org/abs/2212.10560

OpenCompass

https://opencompass.org.cn/

PandaLM

https://arxiv.org/pdf/2306.05087.pdf

WeOpenML • Updated Jun 2, 2024

PromptBench

GitHub - microsoft/promptbench: A robustness evaluation framework for large language models on adversarial prompts

A robustness evaluation framework for large language models on adversarial prompts - GitHub - microsoft/promptbench: A robustness evaluation framework for large language models on adversarial prompts

GitHub - microsoft/promptbench: A robustness evaluation framework for large language models on adversarial prompts

http://aka.ms/promptbench

GitHub - microsoft/promptbench: A robustness evaluation framework for large language models on adversarial prompts

PromptBench - a Hugging Face Space by March07

Discover amazing ML apps made by the community

PromptBench - a Hugging Face Space by March07

https://huggingface.co/spaces/March07/PromptBench

PromptBench - a Hugging Face Space by March07

RobustBench • Updated Jun 1, 2024

LLM-EVAL

https://arxiv.org/pdf/2305.13711.pdf

Loading...

GTX

摸鱼🐟，干饭🍚

统计

文章数:

22

最新发布

湾区20万刀贫困线？——图解美国程序员的“税后真相”

月薪过万到年薪百万：在北京，你的工资去哪儿了？——图解税前与税后收入的秘密

LLM for Science — An Attempt to Solve Partial Differential Equations

RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation

Woodpecker: Hallucination Correction for Multimodal Large Language Models

Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies