BERT论文解读及实现（一）-Toy模板网

这篇具有很好参考价值的文章主要介绍了BERT论文解读及实现（一）。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding

1 论文解读

1.1 模型概览

There are two steps in our framework: pre-training and fine-tuning.
bert由预训练模型+微调模型组成。
① pre-training, the model is trained on unlabeled data over different pre-training tasks.
预训练模型是在无标注数据上训练的
②For fine-tuning, the BERT model is first initialized with
the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks
微调模型在有监督任务上训练的，不通下游任务训练不同的bert模型，尽管预训练模型一样，不同下游任务都有单独的微调模型

下图中，左边为预训练的bert模型，右边为问题答案匹配微调模型。
BERT论文解读及实现（一）,NLP,bert,人工智能,深度学习,语言模型

1.2 模型架构

BERT’s model architecture is a multi-layer bidirectional Transformer encoder
bert模型架构是基于原始transorformer的encoder层

常用参数表示：
L : Transformer blocks 的层数
H: embedding的维度大小
A: 多头注意力机制中的头数,代表使用多少个self attention
BERTBASE (L=12, H=768, A=12, Total Parameters=110M) and BERTLARGE (L=24, H=1024,A=16, Total Parameters=340M).

模型输入：

将两个句子拼接在一起，组成单个句子，句子之间使用 [SEP] token连接
在句子第一个位置加入 [CLS] token ,代表了这个句子的聚合信息。
单独构建句子token 拼接句子用0,1和来表示每个字是属于句子1，还是句子0。

For a given token, its input representation is constructed by summing the corresponding token,segment, and position embeddings。
对应给定的一个token，embedding 表征由三部分组成，分别由字embeddign, segment(句子) embedding，和位置embedding相加而成。，如下图所示。

BERT论文解读及实现（一）,NLP,bert,人工智能,深度学习,语言模型

1.3 预训练BERT

模型完成两个任务；
任务①： Task #1: Masked LM
任务②：下一个句子预测
Task #1: Masked LM
-In order to train a deep bidirectional representation, we simply mask some percentage of the input tokens at random, and then predict those masked tokens.
为了学习句子的双向表征，通过一定比例随机mask句子中的一些token,然后去预测这些token。

In all of our experiments, we mask 15% of all WordPiece tokens in each sequence at random
我们在每个句子中随机mask 掉15%的tokens
Although this allows us to obtain a bidirectional pre-trained model, a downside is that we are creating a mismatch between pre-training and fine-tuning, since the [MASK] token does not appear during fine-tuning.
尽管这种mask机制可以解决双向模型训练问题，但是由于在模型fine-tuning时，并没有mask tokens,导致pre-training 和 fine-tuning节点不一致。
我们采样如下策略处理上面问题
① 每个句子随机选择15%的token进行mask
② 在选择的mask token中，其中80%替换为[mask],10%被随机替换为其他词，剩下10%保持原token不变

Task #2: Next Sentence Prediction (NSP)
背景：许多下游任务是基于2个句子完成的，如 Question Answering (QA) 、 Natural Language Inference (NLI)。

In order to train a model that understands sentence relationships, we pre-train for a binarized next sentence prediction task

任务设计：

对每一个训练，样本选择句子A和句子B
50%的句子B，真的是句子A的下一句，label=IsNext
50%的句子B随机从预料库中选择的，label=NotNext

1.4 Fine-tuning BERT

For applications involving text pairs, a common pattern is to independently encode text pairs before applying bidirectional cross attention BERT instead uses the self-attention mechanism to unify these two stages, as encoding a concatenated text pair with self-attention effectively includes bidirectional cross attention between two sentences.文章来源地址https://www.toymoban.com/news/detail-541899.html

到了这里，关于BERT论文解读及实现（一）的文章就介绍完了。如果您还想了解更多内容，请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章，希望大家以后多多支持TOY模板网！