PandaLM 评估大语言模型 | Gongtianxiang Blog

PandaLM仓库地址:

GitHub - WeOpenML/PandaLM

Contribute to WeOpenML/PandaLM development by creating an account on GitHub.

https://github.com/WeOpenML/PandaLM

PandaLM是什么 PandaLM有什么优势 PandaLM Training Data PandaLM可以改进的地方 PandaLM的性能 PandaLM的工作原理

PandaLM是什么

目前, 大型AI模型的评估方法主要有两种：

GPT-4的评分系统

人工评估

诚然，GPT-4具有卓越的性能，然而，其高昂的价格和难以获取的API使得其不易使用。另一方面，人工评估往往难以实现公正和全面。因此，现在我们迫切需要一种广泛适用的大型AI模型评估框架。PandaLM就是在这种背景下应运而生，旨在满足这一需求。

一句话介绍： PandaLM就是一个用于评估大语言模型的judge大模型。

PandaLM有什么优势

有根据性

PandaLM通过对比多个大语言模型在同一组问题数据集下的回答，两两之中选取出较为优秀的一个，并将判断依据以reference的形式给出，最后将多个模型之间的优劣关系构成DAG，从而得出那个模型性能更优。

注: 评判结果可能与问题数据集选取和 PandaLM的辨别性能有关。

较为全面的评估

与人工评估相比， pandalm本身包含一个较为全面的问题数据集，此外，也可以自己构建问题数据集或用GPT-4 生成的带有大模型回答打分的数据集进一步微调PandaLM。

可获得性

相对于 CloseAI 的GPT-4， PandaLM的模型和评估代码可以轻松从github上获取。

经过测试， PandaLM在对比模型回答的过程中需最大23G显存，评估大模型过程中的另一部分显存消耗是大模型推断过程的显存消耗，比如 vicuna-7b-v1.1 在inference过程中消耗显存约为19G左右， LaMMA-7b 在inference过程中显存消耗约为 19G 左右。

PandaLM Training Data

300k sample response generated by LLMs (LLaMA-7B ,Bloom-7B , Cerebras-GPT-6.7B , OPT-7B , and Pythia-6.9B) which is annotated by crowd-workers

PandaLM可以改进的地方

多语言支持

目前PandaLM的评估仅支持少量语言 [英语 + 法语] ，对于其他语言比如中文的评估暂时不支持。

中文评估可以通过GPT-4 生成一组数据集，并用其来微调PandaLM，使其能够有中文评估的能力。

评测角度

PandaLM提供了一种大模型评估的方式，但是其提供的数据集较小，并没有包含很全面的大模型评估角度，比如：数学能力评估(包含了但很少)，推理(reasoning)能力评估，代码能力评估，等。但是用户可以根据需要，构建特定的评估数据，对PandaLM进行微调，使其具备相对应的能力。

PandaLM的性能

来自PandaLM Github

ㅤ	Open Source	Reproducibility	Security	Access
ChatGPT	❌	❌	❌	Limited
PandaLM	✅	✅	✅	Unlimited

ㅤ	Accuracy	Precision	Recall	F1-score
gpt-3.5-turbo	71.07	58.79	57.36	57.55
PandaLM-7B	66.77	57.38	57.50	57.43
PandaLM-13B	-	-	-	-

PandaLM的工作原理

PandaLM首先需要指定evaluation的数据集，

如pipeline-sanity-check.json


[                                                                                                                                                                                                                                                                            
    {   
        "motivation_app": "Google Meet",
        "instruction": "Summarize a meeting from the given list of bullet points. Be sure to convert shorthand into a first-hand account.",
        "input": "Rose: Analyze data and presents findings\nJohn: propose new idea\nJane: appointed to head project\nTom: need more time to fix software bug"
    },  
    {   
        "motivation_app": "Amazon",
        "instruction": "Make a list of adjectives that can be used to describe the given brand.",
        "input": "a creative tech startup"
    },  
    {   
        "motivation_app": "Wolfram alpha",
        "instruction": "Verify the correctness of the given statement.",
        "input": "\"For all integers j and k, if j and k are odd, then jk is odd.\""
    },  
    {   
        "motivation_app": "Amazon",
        "instruction": "What other Amazon products might interest someone who visited the given product?",
        "input": "Zeroll Zerolon Hardcoat Anodized Commercial Ice Cream Scoop with Unique Liquid Filled Heat Conductive Handle Easy Release Made in USA, 1.5-Ounce, Black"
    },  
    {   
        "motivation_app": "Coursera",
        "instruction": "Come up with the courses that one is supposed to take in order to be an expert in a given field.",
        "input": "Graphic Design"
    },  
    {   
        "motivation_app": "Google Search",
        "instruction": "Based on the given query, suggest some related search queries.",
        "input": "learning french"
    },  
    {   
        "motivation_app": "Jira",
        "instruction": "A user story is an informal, general explanation of a software feature written from the perspective of the end user or customer. Write a user story for a given software.",
        "input": "Gmail"
    },
		...
]

其中 instruction 和 input 将作为LLM的输入，并获得每一条输入的 response (或者说output)

需要评估的LLM 可以看作一个集合 { vicuna , llama , alpaca ….}

PandaLM在评测的过程中会将 evaluation 数据集的数据在每一个LLM上做一次inference，并获得相应的 response ，并且PandaLM是以串行的方式进行LLM的inference。

在获得每个LLM的 response 后，可以得到一个response的list : [ vicuna_resposes, llama_response, alpaca_response ]

PandaLM 对每两组 response 进行评估，如 vicuna 和 llama 之间， vicuna 和 alpaca 之间， llama 和 alpaca 之间。针对每一对相同输入的 response 对，评估结果有 (win, lose 和 tie) 。这样每两个LLM之间，就有了 (num_win, num_lose, num_tie) 三元组，可以通过这样的三元组构建LLM之间性能的DAG偏序图，以此进行模型评估。

PandaLM-7B	LLaMA-7B	Bloom-7B	Cerebras-GPT-6.7B	OPT-7B	Pythia-6.9B
LLaMA-7B	ㅤ	(57,37,17)	(75,26,9)	(60,33,13)	(46,41,7)
Bloom-7B	ㅤ	ㅤ	(57,31,12)	(46,36,7)	(51,41,15)
Cerebras-GPT-6.7B	ㅤ	ㅤ	ㅤ	(37,45,9)	(33,52,6)
OPT-7B	ㅤ	ㅤ	ㅤ	ㅤ	(40,48,12)