Review of WizardLM | Gongtianxiang Blog

type

status

date

slug

summary

1.Highlight

https://tatsu-lab.github.io/alpaca_eval/
Up to 2023/8/26

2.Motivation & Background

目前LLMs在open domain instructions 上 pretrain 是目前大模型训练的主流方法，也取得了很好的效果。

创建这些Instruction data需要消耗大量的人力和时间，而生成有些难度和较为复杂的instruction更是这样。

ShareGPT

用户生成的open domain instruction data，但是其中大多数问题难度处于简单或者中等，一方面可能是用户中expert比例较少，另一方面可能是创建较为困难的instruction需要耗费更多的时间和精力

有没有一种方法能够借助LLM而不是人工来批量生成不同难度和复杂性等级的Instruction data?

3.Contribution

提出Evol-instruct：一种利用大模型进行open domain instruction数据生成的方法

构建了一个Evol-instruct 测试数据集：包含多种skill和各个level难度Instruction的测试数据集

提出了WizardLM模型：一个基于LLaMA的利用Evol-instruct方法fine-tuning的更为强大的LLM

4.Method

4.1 Evol-Instruct(Instruction Data Evolution)

4.1.1 What is Data Evolution?

初始instruction dataset

每次进化时：

利用LLM instruction evolution prompt，将进化为, 然后利用LLM生成对应回应, 从而获得

M次迭代后，可以得到dataset

4.1.2 Automatic Instruction Data Evolution

Pipeline:

指令进化

生成回复

终止进化(过滤掉不能进化的指令)

指令进化器：通过特定prompts进化instruction的LLM

两类进化Prompts:

In-depth Evolving Prompts: 强化指令，让指令进化得更为困难和复杂

五种prompt:

add constraints
deepening
concretizing
increased reasoning steps
complicating input

难度增长控制：每次增加一点难度

add constraints example:

I want you act as a Prompt Rewriter.

Your objective is to rewrite a given prompt into a more complex version to make those famous AI systems (e.g., ChatGPT and GPT4) a bit harder to handle.

But the rewritten prompt must be reasonable and must be understood and responded by humans.

Your rewriting cannot omit the non-text parts such as the table and code in #Given Prompt#:. Also, please do not omit the input in #Given Prompt#.

You SHOULD complicate the given prompt using the following method:

Please add one more constraints/requirements into #Given Prompt#

You should try your best not to make the #Rewritten Prompt# become verbose, #Rewritten Prompt# can only add 10 to 20 words into #Given Prompt#.

‘#Given Prompt#’, ‘#Rewritten Prompt#’, ‘given prompt’ and ‘rewritten prompt’ are not allowed to appear in

#Rewritten Prompt#

#Given Prompt#:

#Rewritten Prompt#:

In-Breadth Evolving Prompts:增强主题覆盖、技能覆盖和整体数据集多样性。设计一个prompt去根据给定的指令生成一个全新的指令

In Breadth Prompts example:

I want you act as a Prompt Creator.

Your goal is to draw inspiration from the #Given Prompt# to create a brand new prompt.

This new prompt should belong to the same domain as the #Given Prompt# but be even more rare.

The LENGTH and difficulty level of the #Created Prompt# should be similar to that of the #Given Prompt#.

The #Created Prompt# must be reasonable and must be understood and responded by humans.

‘#Given Prompt#’, ‘#Created Prompt#’, ‘given prompt’ and ‘created prompt’ are not allowed to appear in

#Created Prompt#.

#Given Prompt#:

#Created Prompt#:

A case from ‘1+1=?’

生成Response：

利用生成Instruction的LLM生成对应的Response

进化中止：判别进化失败的四种条件：

与原始指令相比，进化后的指令没有提供任何信息增益。(使用ChatGPT进行确定)

进化的指令使得LLM很难产生响应。

eg:

当生成的响应包含“sorry”并且长度相对较短(即少于80个单词)时，LLM通常难以响应进化后的指令。

LLM生成的响应只包含标点和终止词。

进化的指令中明显地复制了一些Evolution prompt中的词

4.2 Fine-tuning LLM on Evolved Instruction

4.2.1 Evol Instruction Dataset

将初始数据集和每轮进化后的数据融合在一起，并对样本进行random shuffle 获得fine-tuing 数据集。

确保了数据集中不同难度级别的指令均匀分布，最大限度地提高了模型微调的平滑度。

initial dataset: Alpaca 52K

进化轮次M:4

共计250K instructions

4.2.2 WizardLM

利用上述生成的Evol instruction数据在LLaMA-7B上进行微调，并获得WizardLM

具体训练参数和更多细节可以参考论文4.2部分

4.3 Test

4.3.1 Testset

Evol-Instruct测试集包括来自各种来源(如在线开源项目、平台和论坛)的真实人类指令。

Testset包含29种不同的技能，这些技能代表了人类的主要需求，如代码生成和调试、数学、推理、写作等。

4.3.2 Human Evaluation

在Evol-instruct 测试集, Vicuna 测试集， Evol-instruct 高难度数据集上，通过聘请10名受过良好教育的标记员，对来自Alpaca、Vicuna-7b、WizardLM和ChatGPT的四种响应(隐藏来源)进行打分(1-5)，

- WizardLM的效果明显优于Alpaca和Vicuna-7b，证明了evolo - instruct的有效性。
- 在高难度指令(难度等级>= 8)中，WizardLM比ChatGPT更受人类标注者的青睐。 — - WizardLM的效果明显优于Alpaca和Vicuna-7b，证明了evolo - instruct的有效性。 - 在高难度指令(难度等级>= 8)中，WizardLM比ChatGPT更受人类标注者的青睐。

4.2.2 GPT-4 Automatic Evaluation

采用Vicuna提出的基于GPT-4的自动评估框架来评估LLM的性能。

遵循与Vicuna相同的GPT-4超参数，提示设置和评估方法。

为了减轻顺序偏差，在配对比较中交替放置WizardLM和其他模型:WizardLM是奇数id的第一个，偶数id的第二个。

WizardLM在所有难度水平超过Vicuna，在Easy、Hard技能上都超过Alpaca，在Hard技能上几乎达到了ChatGPT的88%。这表明WizardLM有潜力解决复杂的问题，并减少为LLM训练收集复杂数据的人力。

Evol-instruct 数据集在数据难度和复杂度上优于ShareGPT和Alpaca，并且难度分布上也有更多比例的复杂和困难数据

5.Conclusion

论文提供了一种用于生成更为多样和复杂的指令进化算法Evol-Instruct，并且利用这种方法生成数据训练的WizardLM模型在高复杂性任务上取得了SOTA的结果，在其他指标上取得了具有竞争力的结果。

6. Extension:

6.1 WizardMath:

💡

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

结合强化学习和Evol-Instruct的数学LLM

what is RLEIF(Reinforcement Learning from Evol-Instruct Feedback)?

6.2 WizardCoder

💡

WizardCoder: Empowering Code Large Language Models with Evol-Instruct

将Evol Instruct的能力延申到代码生成领域