InstructGPT高效实践——【DeepSpeed-Chat】源码详解(2/3)：Supervised Finetuning、Reward Model Finetuning-Toy模板网

这篇具有很好参考价值的文章主要介绍了InstructGPT高效实践——【DeepSpeed-Chat】源码详解(2/3)：Supervised Finetuning、Reward Model Finetuning。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

前言

InstructGPT高效实践——【DeepSpeed-Chat】源码详解(2/3)：Supervised Finetuning、Reward Model Finetuning
本篇为上中下三篇章的【中篇】，接续自【上篇】。主要针对三阶段训练中第一阶段、第二阶段较为重要的部分源码进行详解。
尽管官方的上手文档均是以sh文件为例进行演示，且源码中确实也提供了便捷的sh文件供直接使用，但我仍建议通过各阶段的main.py文件（applications/DeepSpeed-Chat/training/step*/main.py）来运行训练，大致原因有二：其一是因为官方预设的sh文件调用了applications/DeepSpeed-Chat/train.py，其中对模型选型等参数进行了严格的限制，虽然提高了代码安全性，但对于以学习为目的的探索来说失去了一定的灵活性（见下方代码块），直接通过main.py进行传参即可实现绕过限制、使用更轻量的模型进行训练；其二是因为main.py中的传参相对更接近底层，更有助于深入理解代码。
因此各阶段的解析我都将在其main.py文件的基础上进行。

# applications/DeepSpeed-Chat/train.py
parser.add_argument(
        "--actor-model",
        # actor模型只能选用“facebook/opt”系列的1.3b~66b
        # 但即使是1.3b也仍对设备显存有一定的要求
        # 对于使用较弱设备进行学习的用户不是很友好
        type=lambda x: x.replace("facebook/opt-", ""),
        default="1.3b",
        choices=("1.3b", "6.7b", "13b", "66b"),
        help="Which facebook/opt-* model to use for Actor (step 1)",
    )

1 phase-1: Supervised Finetuning

1.1 训练数据样例

数据格式名称	说明	样例
chosen_sentence	人类偏好的完整对话，由prompt衔接偏好应答chosen得到，适用于phase1和phase2。	“Human: Please tell me about Microsoft in a few sentence? Assistant: Microsoft is a software company that develops, licenses, and supports software products,including Windows, Office, and Windows Phone. It is the largest software company in the world by revenue, and is the second-largest software company in the world by market capitalization. Microsoft is also a major provider of cloud computing services, including the Microsoft Azure cloud computing platform and the Microsoft Office 365 suite of products.”

模型将基于自回归语言建模任务对形如上述样例的数据进行训练，最后将得到具备更佳对话能力的模型SFT/ref(Supervised Fine-Tuning)。
更多的数据格式可见【上篇】的“1.2.1 数据格式基本概念”。

1.2 训练过程

在此简单讲述UML时序图的元素含义：
- 箭头表示信息传递：实线表示调用，虚线表示返回；
- alt表示假设分支，其后方“[]”中的内容表示“条件”；
- loop表示循环；
- 淡蓝色区域即为高亮部分。

applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py

phase1的大致训练过程如UML时序图所示（“括号序号”与UML时序图的“圈序号”对应）：

载入tokenizer(1-2)；
载入基座模型（目前仅支持部分CausalLM模型）(3-4)；
根据是否设置lora_dim（LoRA的低秩维度）判断是否启用LoRA技术，如果启用，则将基座模型结构进行LoRA改造（具体可见后续详述），并返回改造后的模型(5-6)；
判断是否启用“仅更新LoRA参数”，如果启用，则对其余结构参数进行冻结处理，并返回冻结处理后的模型(7-8)；
获取Dataset（具体流程可见【上篇】）(9-10)；
实例化DataLoader(11)；
使用DeepSpeed的优化技术DeepSpeedEngine包裹模型等对象(12)；
开始正式训练前首先进行指标评估，选用的指标为困惑度perplexity(13-14)；
开始训练，epoch循环：
1. step循环：
  1. 正向传播得到loss(15-18)，如果模型启用了LoRA技术，则正向传播还需要经过LoRA结构(16-17)；
  2. 反向传播计算梯度(19)；
  3. 更新模型参数（其中所涉及的梯度累计gradient_accumulation_steps将由DeepSpeedEngine自动进行管理，无需过度关注）(20)；
2. 经过1个epoch的训练后进行指标评估(21-22)；
3. 保存模型(23)。

1.3 关键代码详解

上述过程存在几个值得关注的地方（即文字描述加粗、UML时序图高亮的部分）：

基座模型的基本结构，主要是观察其所使用的输出头类型，基本就能知道该阶段使用了什么样的模型进行训练；
启用LoRA技术进行结构改造的细节及其正向传播过程；
关于phase1的指标评估方式。

以下将对相关部分的源码进行讲解。

1.3.1 基座模型结构

从基座模型的载入类可以大致知晓模型的结构，可见下方代码块。此处使用了transformers.AutoModelForCausalLM.from_pretrained()来进行模型构建，因此第一阶段的SFT（ref）模型将会是一个因果语言模型/自回归语言模型（CausalLM），其所需要训练的任务自然就是自回归语言建模，即
$\prod_{t=1}^{T} p(x_t|x_{<t})$

# applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py
"""
模型调用create_hf_model方法进行构建,
参数指定有AutoModelForCausalLM
"""
model = create_hf_model(AutoModelForCausalLM, ···)

# applications/DeepSpeed-Chat/training/utils/model/model_utils.py                        
def create_hf_model(model_class, ···):
	···
	"""model_class=AutoModelForCausalLM"""
	model = model_class.from_pretrained(
	            model_name_or_path,
	            from_tf=bool(".ckpt" in model_name_or_path),
	            config=model_config)
    ···

1.3.2 LoRA结构及其正向传播

LoRA技术的大致思路如下图所示：
InstructGPT高效实践——【DeepSpeed-Chat】源码详解(2/3)：Supervised Finetuning、Reward Model Finetuning

在关键的参数层中加入旁路；
原参数冻结不变，训练时优化旁路参数；
原路输出 $W x$ 和旁路输出 $B A x$ 的加和即为最终输出 $h = W x + B A x$ 。

LoRA结构定义
而DeepSpeed-Chat的实现基本与上述思路一致，当设置LoRA的低秩维度lora_dim（如lora_dim=128）时，即认为启用了LoRA训练，则将原始模型中名称含有“deoder.layers.”且为线性层修改为LoRA层，具体操作为：

将原始结构的weight参数冻结；
新引入了2个线性层lora_right_weight和lora_left_weight，可实现先降维至lora_dim再升维回原维度；
LoRA层主要实现了两分支通路，一条分支为已被冻结weight参数的原始结构、另一条分支为新引入的降维再升维线性层组。

# applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py
# 判断是否启用LoRA模式
if args.lora_dim > 0:
"""
如果启用，则对名称中含有“decoder.layers.”且为线性层的结构部分引入LoRA旁路（实现先降维后升维的2个线性层），
这类结构基本都是attention、信息交互用的inner线性层，
这类结构的Weight参数将被冻结，转而优化LoRA旁路的参数。
"""
    args.lora_module_name = "decoder.layers."
    model = convert_linear_layer_to_lora(model, args.lora_module_name,
                                         args.lora_dim)

# applications/DeepSpeed-Chat/training/utils/module/lora.py
def convert_linear_layer_to_lora(model,
                                 part_module_name,
                                 lora_dim=0,
                                 lora_scaling=1,
                                 lora_droppout=0):
    """
	将名称中带有"decoder.layers."的线性层转换为lora层
	"""
	"""取出模型中参数名含有decoder.layers.的线性层"""
    repalce_name = []
    for name, module in model.named_modules():
        if isinstance(module, nn.Linear) and part_module_name in name:
            repalce_name.append(name)
    for name in repalce_name:
    	"""recursive_getattr实现了从model中根据属性名取出对应原始结构"""
        module = recursive_getattr(model, name)
        """纳入原始结构的参数，实例化lora层"""
        tmp = LinearLayer_LoRA(
            module.weight, lora_dim, lora_scaling, lora_droppout,
            module.bias).to(module.weight.device).to(module.weight.dtype)
        """recursive_getattr实现了将model对应属性的结构换成lora层实例"""
        recursive_setattr(model, name, tmp)
    return model
    
# applications/DeepSpeed-Chat/training/utils/module/lora.py
class LinearLayer_LoRA(nn.Module):
	"""具体的lora层"""
	def __init__(...):
		...
		"""此处的weight和bias即为原始结构中的参数"""
		self.weight = weight
		self.bias = bias
		···
		"""冻结weight部分的参数"""
		self.weight.requires_grad = False
		···
		self.lora_right_weight = nn.Parameter(torch.zeros(columns, lora_dim))
	    self.lora_left_weight = nn.Parameter(torch.zeros(lora_dim, rows))
	    ···
	    """初始化LoRA线性层的参数"""
	    self.reset_parameters()
    def reset_parameters(self):
    	"""初始化LoRA线性层的参数"""
    	# 降维矩阵使用kaiming均匀分布初始化，
    	# 服从均匀分布U(-\sqrt{1/in_feature}, +\sqrt{1/in_feature})
    	# 与LoRA原始定义所用的(0,\sigma^2)正态分布初始化不同
        nn.init.kaiming_uniform_(self.lora_right_weight, a=math.sqrt(5))
        # 升维矩阵使用全0初始化
        nn.init.zeros_(self.lora_left_weight)
    def forward(self, input):
    	"""LoRA的正向传播"""
    	···
    	else:
	    	return F.linear(input, self.weight, self.bias) 
	    			+ (self.lora_dropout(input) @ self.lora_right_weight @ self.lora_left_weight) * self.lora_scaling

经过LoRA改造后，原始基座模型（此处的基座模型为“facebook/opt-125m”）的结构如下所示，可见模型中除了输出头部分的线性层基本都被改成了LoRA结构，因此模型在进行正向传播时也将流经LinearLayer_LoRA(nn.Module)中所定义的forward()方法（见上方代码块forward()部分）。

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 768, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
      (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (layers): ModuleList(
        (0-11): 12 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): LinearLayer_LoRA(
              (lora_dropout): Identity()
            )
            (v_proj): LinearLayer_LoRA(
              (lora_dropout): Identity()
            )
            (q_proj): LinearLayer_LoRA(
              (lora_dropout): Identity()
            )
            (out_proj): LinearLayer_LoRA(
              (lora_dropout): Identity()
            )
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1): LinearLayer_LoRA(
            (lora_dropout): Identity()
          )
          (fc2): LinearLayer_LoRA(
            (lora_dropout): Identity()
          )
          (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        )
      )
    )
  )
  (lm_head): Linear(in_features=768, out_features=50272, bias=False)
)

LoRA正向传播
常规部分的正向传播由transformers所定义，而LoRA部分的正向传播则由LinearLayer_LoRA(nn.Module)的forward()所定义（可见下方代码块），即“LoRA层的两条分支结果进行加和”。在代码中体现为F.linear(input, self.weight, self.bias) + (self.lora_dropout(input) @ self.lora_right_weight @ self.lora_left_weight) * self.lora_scaling，加号左侧为原结构支路，加号右侧为新增支路，self.lora_right_weight和self.lora_left_weight分别为两个新引入线性层的参数。

# applications/DeepSpeed-Chat/training/utils/module/lora.py
class LinearLayer_LoRA(nn.Module):
	"""具体的lora层"""
	···
    def forward(self, input):
    	"""LoRA的正向传播"""
    	···
    	else:
	    	return F.linear(
	                input, self.weight,
	                self.bias) + (self.lora_dropout(input) @ self.lora_right_weight
	                              @ self.lora_left_weight) * self.lora_scaling

1.3.3 phase1的指标评估

DeepSpeed-Chat选择了困惑度perplexity作为phase1训练期间的评估指标。需要注意的是，perplexity不是绝对的评估准则，甚至有可能perplexity评估结果与实际情况并不一致（即，perplexity已经处于较低水平，但模型的实际生成能力却仍然堪忧），这点DeepSpeed-Chat团队也有做出过说明。

Supervised fine-tuning (SFT) has indeed made significant progress in the field of large language models (LLMs). However, unexpected behaviors such as repeating content generation and inconsistency between perplexity (PPL) scores and generation capabilities can still occur.

但无论如何，源码中phase1定义的evaluation是基于perplexity来进行的，我们仍有必要具体了解其实现过程。

困惑度perplexity是一种度量语言模型性能的指标，它衡量了训练好的模型对测试数据的拟合程度，对于输出句子的每个token，都可以得到其输出的置信概率值，将这些值相乘并取其几何平均数的倒数即可计算得到困惑度perplexity，使用公式表达更为简洁：
$(\prod_{t=1}^{T} p_t)^{-\frac{1}{T}}$
其中，输出的句子共有 $T$ 个token，第 $t$ 个token的置信概率值为 $p_t$ 。

而CausalLM模型的训练过程通常采用对数似然损失来进行优化，其输出的损失公式如下：
$-\frac{1}{T} \sum_{t=1}^{T}\log{p_t}$
其中，输出的句子共有 $T$ 个token，第 $t$ 个token的置信概率值为 $p_t$ 。

因此perplexity与CausalLM的loss之间实际存在如下关系：
$\exp(loss)$

相关源码的perplexity计算也是基于上述公式得到的：先是将验证数据输入至模型，得到模型loss输出，然后通过perplexity与loss之间的指数关系计算得到perplexity。

    def evaluation(model, eval_dataloader):
        """
        以困惑度perplexity为评估指标进行验证
        """
        model.eval()
        losses = 0
        for step, batch in enumerate(eval_dataloader):
            """
            batch: 由input_ids、attention_mask、labels共3个部分组成的dict。
            其中每个部分的shape均为(bs, max_seq_len)
            """
            batch = to_device(batch, device)
            with torch.no_grad():
                outputs = model(**batch)
            """Causal LM 的损失函数为交叉熵损失"""
            loss = outputs.loss
            losses += loss.float()
        losses = losses / (step + 1)
        try:
            """困惑度perplexity通常可以通过exp(CELoss)计算得到"""
            perplexity = torch.exp(losses)
        except OverflowError:
            perplexity = float("inf")
        try:
        	"""
        	- get_all_reduce_mean中调用了torch.distributed.all_reduce(perplexity, op=torch.distributed.ReduceOp.SUM)
        	- 对所有进程、或者说GPU（因为通常情况下就是单个进程控制单个GPU）中的perplexity进行求和
        	- 然后再除以全局进程数torch.distributed.get_world_size()得到平均的perplexity结果
        	"""
            perplexity = get_all_reduce_mean(perplexity).item()
        except:
            pass
        return perplexity

1.4 实例测试

“实例测试”与“指标评估”并不是完全相同的概念，实例测试是选择具体的数据实例输入进模型中，人工观察其输出结果，而非使用具体指标对结果进行评估。实例测试实际上更体现了正向传播到解码、再到返回具体文本的过程。例如我对模型输入了一句prompt文本，那么整个实例测试流将会返回给我一份answer文本，我将以主观视角来感知这个answer的优劣程度，而不是采用具体的指标来进行评估。
待完善…

1.5 相关拓展

1.5.1 多轮对话性能

倘若想要使得模型通过具备更好的多轮对话的性能，除了需要考虑其“潜力”（就目前技术来说，模型支持的最大序列长度即为“潜力”，不可否认未来是否会出现新的长文本拓展技术）外，其多轮对话性能表现仍主要与本阶段的训练数据有关，还需要为本阶段加入更多的多轮对话数据，因此需要明确的是，用于本阶段的训练数据并不只局限于单轮对话，同样可以使用多轮对话内容进行训练，多轮对话无非只是具有更长的prompt而已，单轮对话与多轮对话的数据样例可见下表。

单轮或多轮	样例
单轮prompt	"Human: Please tell me about Microsoft in a few sentence? Assistant: "
多轮prompt	“Human: I’m buying a new video game console for the first time since in a decade, but I forget their setups. What do I need in addition to a Playstation 5 in order to play the console? Assistant: You need to buy a Playstation 5 first. Then you’ll also need a TV with HDMI input. It’s possible that HDMI to HDMI cables will also work, but it’s more likely that you’ll need a physical HDMI cord. You also might want to buy an extra power adapter, since the ones that come with new Playstation 5’s are quite short. Are you looking to play on a PC or any other system? That might affect what other hardware you need to buy. Human: Playstation 5’s cables aren’t short, but that’s good information. Can any television with an HDMI input play PS5? Assistant:”
单轮chosen_sentence	“Human: Please tell me about Microsoft in a few sentence? Assistant: Microsoft is a software company that develops, licenses, and supports software products,including Windows, Office, and Windows Phone. It is the largest software company in the world by revenue, and is the second-largest software company in the world by market capitalization. Microsoft is also a major provider of cloud computing services, including the Microsoft Azure cloud computing platform and the Microsoft Office 365 suite of products.”
多轮chosen_setence	“Human: I’m buying a new video game console for the first time since in a decade, but I forget their setups. What do I need in addition to a Playstation 5 in order to play the console? Assistant: You need to buy a Playstation 5 first. Then you’ll also need a TV with HDMI input. It’s possible that HDMI to HDMI cables will also work, but it’s more likely that you’ll need a physical HDMI cord. You also might want to buy an extra power adapter, since the ones that come with new Playstation 5’s are quite short. Are you looking to play on a PC or any other system? That might affect what other hardware you need to buy. Human: Playstation 5’s cables aren’t short, but that’s good information. Can any television with an HDMI input play PS5? Assistant: So you’ve got a Playstation 5 and a TV that you’re going to connect together with an HDMI cable, and you want to know if that’s going to work? It’s definitely possible for the two to work together, and you might need an additional power adapter if your TV only came with a shorter adapter. However, it may be difficult to determine if it will work for sure. This is one area where troubleshooting and making educated guesses may be necessary. You should still be able to easily use your console, but it may be necessary to troubleshoot first.”

1.5.2 本阶段训练更倾向过拟合

DeepSpeed-Chat团队称，根据InstructGPT的建议，本阶段的训练结果应适当倾向于过拟合（可以考虑更多的epoch），以此获得更好的对话能力。DeepSpeed-Chat团队还发现这个设计尤其对诸如opt-1.3B这类较小的模型微调特别有效。

From InstructGPT work, it is recommended to train the model for overfitting (aka longer epochs) for better human-preferred answers. Through our exploration, we have found this to be particularly helpful for smaller model finetuning, such as OPT-1.3B.

1.6 版块相关问题

暂无

2 phase-2: Reward Model Finetuning

2.1 训练数据样例

数据格式名称	说明	样例
chosen_sentence	人类偏好的完整对话，由prompt衔接偏好应答chosen得到，适用于phase1和phase2。	“Human: Please tell me about Microsoft in a few sentence? Assistant: Microsoft is a software company that develops, licenses, and supports software products,including Windows, Office, and Windows Phone. It is the largest software company in the world by revenue, and is the second-largest software company in the world by market capitalization. Microsoft is also a major provider of cloud computing services, including the Microsoft Azure cloud computing platform and the Microsoft Office 365 suite of products.”
reject_sentence	人类排斥的完整对话，由prompt衔接排斥应答rejected得到，适用于phase2。	“Human: Please tell me about Microsoft in a few sentence? Assistant: I’m not sure what you mean.”

模型将基于排序损失对形如上述样例的数据对进行训练，最后将得到具备类人评分能力的RM(Reward Model)。
更多的数据格式可见【上篇】的“1.2.1 数据格式基本概念”。

2.2 训练过程

在此简单讲述UML时序图的元素含义：
- 箭头表示信息传递：实线表示调用，虚线表示返回；
- alt表示假设分支，其后方“[]”中的内容表示“条件”；
- loop表示循环；
- 淡蓝色区域即为高亮部分。

phase2的大致训练过程如UML时序图所示（“括号序号”与UML时序图的“圈序号”对应）：

载入tokenizer(1-2);
载入模型（rm_model），其中涉及一定的结构更改(3-8)；
根据是否设置lora_dim（LoRA的低秩维度）判断是否启用LoRA技术，如果启用，则将基座模型结构进行LoRA改造（具体可见后续详述），并返回改造后的模型(9-10)；
判断是否启用“仅更新LoRA参数”，如果启用，则对其余结构参数进行冻结处理，并返回冻结处理后的模型(11-12)；
获取Dataset（具体流程可见【上篇】）(13-14)；
实例化DataCollator，用于进一步对加载的数据进行整理(15-16)；
实例化DataLoader(17)；
使用DeepSpeed的优化技术DeepSpeedEngine包裹rm_model等对象(18)；
开始正式训练前首先进行指标评估，选用的指标为排序结果的准确率accuracy(19-20)；
10.开始训练，epoch循环：
1. step循环：
  1. 正向传播得到loss(21-26)，如果模型启用了LoRA技术，则正向传播还需要经过LoRA结构(23-24)；
  2. 反向传播计算梯度(27)；
  3. 更新模型参数（其中所涉及的梯度累计gradient_accumulation_steps将由DeepSpeedEngine自动进行管理，无需过度关注）(28)；
2. 经过1个epoch的训练后进行指标评估(29-30)；
3. 保存模型(31)。

2.3关键代码详解

上述过程存在几个值得关注的地方（即文字描述加粗、UML时序图高亮的部分）：

rm_model(RM)的具体结构；
phase2的数据整理器DataCollatorReward所实现的操作，通过这部分可以了解rm_model所需的输入形式；
关于phase2的指标评估方式；
rm_model的正向传播过程。

2.3.1 RM具体结构

首先使用transformers的AutoModel类来读取指定模型的主干网络（不直接定义有输出头的网络结构），然后引入一个可实现从hidden_size降维至1的线性层，该线性层将作为主干网络的输出头，为输入序列的每个位置输出1个评分。

# applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/main.py
"""
rm_model调用了create_critic_model进行载入
默认情况下rm_model是不启用dropout的
"""
rm_model = create_critic_model(···)

# applications/DeepSpeed-Chat/training/utils/model/model_utils.py
def create_critic_model(···):
    """此处的模型读取方法用的是“AutoModel”，因此此处critic_model只有主干部分"""
    critic_model = create_hf_model(AutoModel, ···)
    """
    critic_model传入RewardModel，将额外得到线性层输出头，
    因此此处的critic_model结构为“v_head + 主干部分”
	"""
    critic_model = RewardModel(critic_model, ···)
    ...
    return critic_model

# applications/DeepSpeed-Chat/training/utils/model/reward_model.py
class RewardModel(nn.Module):
    """
    将读取得到的model的结构修改为适用于RewardModel的形式，
    总的来说即是使用载入的主干网络进行特征提取，
    其所提取的特征（最后层的各位置输出特征hidden_states）将被传入线性层，输出得到1个数值，
    该数值即为分值，因此max_seq_len维度的每个位置均会得到1个分值
    """
    def __init__(self, base_model, ...):
        super().__init__()
		···
        if hasattr(self.config, "word_embed_proj_dim"):
        	"""
			OPT系列模型的word_embed_proj_dim为embedding层的输出维度，
			通常在transformer模型中也就等于 hidden_size，
			v_head将基于主干网络的输出特征 hidden_state 进行分值预测，共输出max_seq_len个分值
			"""
            self.v_head = nn.Linear(self.config.word_embed_proj_dim,
                                    1,
                                    bias=False)
        ···
        """base_model即为主干网络，因此RM最终由1个主干网络和1个线性层构成"""
        self.rwtranrsformer = base_model

RM的模型结构基本如下所示（此处的基座模型为“facebook/opt-125m”），由主干网络rwtransformer及输出头v_head组成：

RewardModel(
  (v_head): Linear(in_features=768, out_features=1, bias=False)
  (rwtranrsformer): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 768, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
      (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (layers): ModuleList(
        (0-11): 12 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        )
      )
    )
  )
)

2.3.2 DataCollator及RM所需输入形式

phase2使用的数据整理器data_collator为DataCollatorReward()，本阶段取出的单个样本example实际上是一个chosen-rejected数据对（见下方代码块），即1个大小为batch_size的batch取出了batch_size个数据对，data_collator将把数据对拆成chosen_sentence和reject_sentence（example一分为二），因此实际上1个batch真正输入模型的数据量大小应当为“batch_size * 2”。

# applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/main.py
"""phase2使用的data_collator为DataCollatorReward()"""
data_collator = DataCollatorReward()

# applications/DeepSpeed-Chat/training/utils/data/data_utils.py
class DataCollatorReward:
    def __call__(self, data):
        """
        对dataloader取到的数据 data 进一步整理，将数据整理成batch输入形式
        入参 data 的具体样式可见下个代码块
        """
        batch = {}
        """f为data中的1个tuple，tuple的第0个元素和第2个元素
        分别为chosen_sentence和reject_sentence的input_ids"""
        batch["input_ids"] = torch.cat([f[0] for f in data] + 
        							   [f[2] for f in data],
                                       dim=0)
        """f为data中的1个tuple，tuple的第1个元素和第3个元素
        分别为chosen_sentence和reject_sentence的attention_mask"""
        batch["attention_mask"] = torch.cat([f[1] for f in data] +
                                            [f[3] for f in data],
                                            dim=0)
        """batch的具体样式可见下个代码块"""
        return batch

输入的data为一个batch的数据列表，其中的 每个元素 为一对chosen-rejected数据：
	(
	 chosen_sentence_input_ids, 
	 chosen_sentence_attention_mask,
	 reject_sentence_input_ids,
	 reject_sentence_attention_mask
	)

每组数据的第0个元素和第2个元素为input_ids，第1个元素和第3个元素为attention_mask。

输出的batch为字典：{“input_ids”: tensor([...]), "attention_mask": tensor([...])}
并且字典值中chosen位于前半部分，rejected位于后半部分：
	{
	"input_ids": [
				  chosen_sentence_1_input_ids,
				  chosen_sentence_2_input_ids,
				  ...,
				  reject_sentence_1_input_ids,
				  reject_sentence_2_input_ids,
				  ...
				 ]
	"attention_mask": [
					   chosen_sentence_1_attention_mask,
					   chosen_sentence_2_attention_mask,
					   ...,
					   reject_sentence_1_attention_mask,
					   reject_sentence_2_attention_mask,
					   ...
					  ]
		
	}
后续输入模型后，直接将数据切分出前半部分和后半部分进行并列，即可获得对应的chosen-rejected数据对。

2.3.3 RM的正向传播及成对排序损失

RM的正向传播过程不算复杂，总的来说就是：

数据经过主干网络得到shape为(bs*2, max_seq_len, hidden_size)的最后层输出特征hidden_states；
然后将输出特征送入线性层v_head得到shape为(bs*2, max_seq_len)的评分rewards。

较为复杂的部分实际上是“成对排序损失的计算”以及“评分聚合设计”。

成对排序损失（Pairwise Ranking Loss）
$loss(\theta) = E_{(x, y_c, y_r) \sim{D}} [-log(\sigma(r_\theta(x, y_c) - r_\theta(x, y_r)))]$
其中， $r_\theta$ 为RM， $x$ 为prompt， $y_c$ 为chosen， $y_r$ 为rejected， $x, y_c)$ 和 $x, y_r)$ 则分别为chosen_sentence和reject_sentence。
该损失函数的目的在于最大化“chosen/好的/排序靠前的”和“rejected/坏的/排序靠后的”的差值，由此促使 $r_\theta$ 学习到相应的排序模式。
DeepSpeed-Chat在实现这部分时， $r_\theta(x,y_c)$ 和 $r_\theta(x,y_r)$ 分别选择了chosen_sentence和reject_sentence两者answer的对齐部分，通过文字叙述略显抽象，查看下方的代码块有助于你理解这个概念：

max_seq_len为10，pad_token_id为0，
有同属同个prompt的chosen_sentence和reject_sentence:
prompt: [11, 22, 33]
chosen_sentence: [11, 22, 33, 44, 55, 66, 0, 0, 0, 0]
reject_sentence: [11, 22, 33, 40, 50, 0, 0, 0, 0, 0]

“两者answer的对齐部分”即为“非prompt部分也非padding部分、但长度要对齐”：
chosen_truncated: [44, 55, 66]
reject_truncated: [40, 50, 0]

chosen_sentence的answer比较长，所以reject_sentence在取相应部分时要取至与chosen部分等长为止；
reject_sentence的answer较长时同理。

为了取到上述提及的“对齐部分”，代码进行了较为晦涩抽象的取index操作，但只要理解其最终目的是为了取到chosen_sentence和reject_sentence对齐部分的reward，来进行损失计算即可。

对话奖励设计
尽管使用的是“对齐部分”的reward来计算成对排序损失，但RM模型对一个对话的预测评分实际上取的是该对话文本最后一个有效token（通常会是“结束标记”）的reward，下方代码块提供了一个简单例子说明了这个情况。

pad_token_id = 0
conversation = [11, 22, 33, 44, 55, 66, 0, 0, 0, 0]
conversation_rewards = [2.01, 0.23, 2.89, 0.66, 0.33, 2.25, 0.36, 0.99, 1.32, 1.62]
token_id为66的token作为该对话的最后1个有效token，
其对应的reward“2.25”将被用于表示整个对话的reward。

整体代码如下所示：

# applications/DeepSpeed-Chat/training/utils/model/reward_model.py
class RewardModel(nn.Module):
	def __init__(self, ···):
		···
	···
	def forward(self, input_ids=None, ···):
		"""获得主干网络的输出的特征"""
		transformer_outputs = self.rwtranrsformer(···)
		"""
		取最后一层的输出特征
		hidden_states.shape: (bs*2, max_seq_len, hidden_size)
		"""
		hidden_states = transformer_outputs[0]
		"""
		将特征送入全连接层得到分数回归值
		rewards.shape: (bs*2, max_seq_len)
		"""
		rewards = self.v_head(hidden_states).squeeze(-1)
		"""先前提及过，实际的bs应该是输入bs的一半"""
		bs = input_ids.shape[0] // 2
		"""区分出chosen和reject"""
		chosen_ids = input_ids[:bs]
		rejected_ids = input_ids[bs:]
		chosen_rewards = rewards[:bs]
		rejected_rewards = rewards[bs:]
		
		loss = 0
        for i in range(bs):
            """
            取出同组chosen和rejected的token_id和分值reward
            chosen_id.shape: (max_seq_len, )
            """
            chosen_id = chosen_ids[i]
            rejected_id = rejected_ids[i]
            chosen_reward = chosen_rewards[i]
            rejected_reward = rejected_rewards[i]
            
			"""
			下方本应有各种取index相关的操作，
			基于源码解读的可读性考量，且这些部分只是逻辑形式上的弯弯绕绕，与相关原理并不存在直接关系，
			所以我选择暂且将它们忽略。
			"""
			
			"""
			c_ind为chosen_sentence的answer后的第一个pad_token的index
			例如pad_token_id=0，sentence[11,22,33,44,55,66,0,0,0,0]，c_ind即为第一个pad_token的index=6。
			"""
            c_ind = ···
            """r_ind同理，为reject_sentence的answer后的第一个pad_token的index"""
            r_ind = ···
            """end_ind则为两者的较大者"""
            end_ind = max(c_ind, r_ind)
                # 取chosen和rejected第一个不同的地方的index，可以理解为“response中两个回答自由发挥的第1个token的index”
            """divergence_ind为chosen_sentence和reject_sentence两者answer的第1个token的index"""
            divergence_ind = ···

            """
            以chosen_sentence和reject_sentence最先不同的地方为起始、生成结束的地方为终止，取两者在这个片段的对应分值
            这部分其实就是上个代码块提及的“对齐部分”
            """
            c_truncated_reward = chosen_reward[divergence_ind:end_ind]
            r_truncated_reward = rejected_reward[divergence_ind:end_ind]
            """
            (c_truncated_reward - r_truncated_reward).shape: (truncated_seq_len,)
            计算损失时使用了rank loss的形式，并且是对chosen和rejected“对齐片段”进行计算的
            """
            loss += -torch.log(
                torch.sigmoid(c_truncated_reward - r_truncated_reward)).mean()
		
        loss = loss / bs
        
        """取代表结束的pad token所在位置的前一个位置（可以理解为的最后一个有效token的位置）的分值作为参考分值"""
            chosen_mean_scores.append(
                chosen_reward[c_ind - 1])  #use the end score for reference
            rejected_mean_scores.append(rejected_reward[r_ind - 1])
        chosen_mean_scores = torch.stack(chosen_mean_scores)
        rejected_mean_scores = torch.stack(rejected_mean_scores)
        
        """返回损失和参考分值"""
        return {
            "loss": loss,
            "chosen_mean_scores": chosen_mean_scores,
            "rejected_mean_scores": rejected_mean_scores,
        }
   ···

2.3.4 phase2的指标评估

DeepSpeed-Chat在phase2中使用的评估指标为排序正确的accuracy，主要过程为：

将数对chosen-rejected数据对（过程中被data_collator拆分为chosen_sentence和reject_sentence）输入RM中进行推理，得到各个sentence的分值；
将同属一个prompt的chosen_sentence得分与reject_sentence得分进行比较，当chosen_sentence得分大于reject_sentence得分时，即为“正确预测”，否则为“错误预测”；
统计正确预测的结果，计算accuracy作为评估指标。
此外评估过程中还将统计平均的chosen_sentence分值“scores”供参考。

def evaluation_reward(model, eval_dataloader):
    model.eval()
    """统计预测（赋分）正确的结果
    即 chosen_reward > rejected_reward 的结果数"""
    correct_predictions = 0
    """统计预测总数"""
    total_predictions = 0
    scores = 0
    for step, batch in enumerate(eval_dataloader):
        batch = to_device(batch, device)
        with torch.no_grad():
            """outputs: {'loss':tensor(), 
            			'chosen_mean_scores':tensor(bs,), 
            			'rejected_mean_scores':tensor(bs,)}"""
            outputs = model(**batch)

        """chosen.shape: (bs,)"""
        chosen = outputs["chosen_mean_scores"]
        """rejected.shape: (bs,)"""
        rejected = outputs["rejected_mean_scores"]
        """"赋分正确"即为chosen分值大于rejected分值"""
        correct_predictions += (chosen > rejected).sum()
        total_predictions += chosen.shape[0]
        """累加每个step的平均chosen分值"""
        scores += outputs["chosen_mean_scores"].mean().float()
        if step == 99:  # For faster evaluation and debugging
            break
    """计算acc指标"""
    acc = correct_predictions / total_predictions
    """计算当前step的平均chosen分值"""
    scores = scores / (step + 1)
    try:
        """多进程结果求和求平均"""
        acc = get_all_reduce_mean(acc).item()
        scores = get_all_reduce_mean(scores).item()
    except:
        pass
    return scores, acc

2.4 实例测试

“实例测试”与“指标评估”并不是完全相同的概念，实例测试是选择具体的数据实例输入进模型中，人工观察其输出结果，而非使用具体指标对结果进行评估。
待完善…

2.5 相关拓展

2.5.1 对话奖励聚合设计

在DeepSpeed-Chat的实现中，RM模型对一个对话的预测评分实际上取的是该对话文本最后一个token的reward，当然此处并不是只能采用这种方式对对话进行评分，这是一个开放性的策略设计，只是DeepSpeed-Chat团队采取了这样的实现，用户当然也可以自己制定评分的处理策略，比如answer部分的平均reward、序列reward再接全连接层得到聚合rewad等等。

In our implementation, we use either the end token of the sequence or the first padding token as the aggregated score and compare them. Others may also use the average score for the entire answer as an alternative.

2.6 板块相关问题

暂无

后续

RLHF阶段的训练具体内容可见【下篇】。文章来源地址https://www.toymoban.com/news/detail-473272.html

到了这里，关于InstructGPT高效实践——【DeepSpeed-Chat】源码详解(2/3)：Supervised Finetuning、Reward Model Finetuning的文章就介绍完了。如果您还想了解更多内容，请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章，希望大家以后多多支持TOY模板网！