[论文阅读笔记75]P-Tuning v2-Toy模板网

这篇具有很好参考价值的文章主要介绍了[论文阅读笔记75]P-Tuning v2。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

1. 基本信息

题目	论文作者与单位	来源	年份
P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks	Xiao Liu等Tsinghua University	清华大学	2021

Citations, References

论文链接：https://arxiv.org/pdf/2110.07602.pdf

[1] Liu X , Ji K , Fu Y , et al. P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks[J]. 2021.

论文代码：https://github.com/THUDM/P-tuning-v2

2. 要点

研究主题	问题背景	核心方法流程	亮点	数据集	结论	论文类型	关键字
微调大模型	Prompt tuning模型有很大的局限性，提示调优对于正常大小的预训练模型表现不佳；另外，当前的prompt tuning处理不了序列标注任务，解决不具通用性。尤其的NLU任务上。	提出P-Tuning v2 是deep Prompt Tuning的一个实现。对NLU进行优化与适应。	只微调0.1~0.3%的参数可以达到与全参数微调的水平。	SuperGLUE	提升了P-tuning10B以下数模型性能，在NLU任务上例如实体命名，采用0.1%~3%的训练参数达到了Fine-tuning的水平。

P-TuningV2在P-tuning的基础上差不多增加多了10倍的数据量，在一般的模型上有比较好的效果。提升了P-tuning在一般小参数模型效果，基本上达到了Fine-tuning的水平。

（RTE、BoolQA、CB验证集的平均准确率）：

[论文阅读笔记75]P-Tuning v2,论文阅读,笔记,深度学习

3. 模型(核心内容)

3.1 P-tunning 与 P-tunning V2进行对比图

[论文阅读笔记75]P-Tuning v2,论文阅读,笔记,深度学习

Prefix-Tuning：Optimizing Continuous Prompts for Generation

Lester et al是google的，提出了“prompt tuning”.

P-tuning与Lester的模型只是在embedding层加入Prompt. P-tuning v2在所以层都加Prompt.

旧的方法存在问题：

a. 可调的参数受限制。

b. embedding与模型的输出没有直接的关系。

为了解决这些问题，P-tuning v2 employs the idea of deep prompt tuning。

优化与实现细节：

Reparameterization（重参数化）：前面的研究喜欢用MLP进行对参数进行转换，可是在NLU的任务中，这种方法取决于任务与数据集；

**Prompt Length（Prompt长度）：**提示符长度在P-Tuning v2中起着关键作用。一般来说，简单的分类任务偏向于更短的提示（少于20个）；硬序列标记任务偏向于更长的提示（大约100个）。

[论文阅读笔记75]P-Tuning v2,论文阅读,笔记,深度学习

**Multi-task Learning（多任务学习）：**多任务对于P-Tuning v2是可选的，但是可以通过提供更好的初始化来进一步提高性能。

**Classifification Head（分类头）：**P-tuning v2 instead applies a randomly-initialized classifification head on top of the tokens as in BERT。

4. 实验与分析

4.1 实验内容

NLU Tasks：SuperGLUE.

BoolQ：问答任务；

CB（Commitment Bank）：文本蕴含任务；

COPA（Choice of Plausibe Ansewers）: 选择推理任务；

MultiRC（Multi-Sentence Reading Comprehension）：真假问答任务；

ReCoRD（Reading Comprehension with Commonsense Reasoning Dataset）: 问答式的NER；

RTE（Recognizing Textual Entailment）：文本蕴含任务；

WiC（Words in Context）：目标词在待分析两个句子中意思是不是一样；

WSC（The Winograd Schema Challenge）：**阅读理解任务； **

Pre-trained Models：BERT-large，RoBERTa-large，DeBERTa-xlarge，GLM－xlarge/xxlarge

Multitask Learning：Name entity recognition (NER)，(Extractive) Question Answering (QA)，Semantic Role Labeling (SRL)

NER(IOB2格式)： CoNLL03，OntoNotes 5.0，CoNLL04. multi-task is combination of three datasets;

**QA：**SQuAD1.1， SQuAD2.0，multi-task setting is combines the training sets of SQuAD 1.1,and 2.0;

**SRL：**CoNLL05， CoNLL12，multi-task setting is combination of the training set of CoNLL05， CoNLL12;

4.2 效果

关于不同的模型规模对比：

[论文阅读笔记75]P-Tuning v2,论文阅读,笔记,深度学习

Across Tasks对比：
[论文阅读笔记75]P-Tuning v2,论文阅读,笔记,深度学习

NER对于多任务：结合三个数据集的训练集进行预训练。共享continuous prompts，对每个数据集使用不同的线性分类器。

QA对于多任务：预训练用SQuAD 1.1和2.0的合并的数据集进行训练集，在预测训练时假设所有的问题，不管来源，都有可能没有答案。

消融实验：

Verbalizer with LM head v.s. [CLS] label with linear head

[论文阅读笔记75]P-Tuning v2,论文阅读,笔记,深度学习

Prompt depth:降序添比按升序添加它们要好

[论文阅读笔记75]P-Tuning v2,论文阅读,笔记,深度学习

从这个图来看，加入prompt的层与数据集有很大的关系的，RTE加到17-24层就可以了。可是BoolQ则是越多层越好。

5. 代码

6. 总结

这个实验效果是可喜的，特别在NLU的任务上，一个优势时，预模型不用太大，另一个不用保存多一份模型的副本。还有一个，这里采用了CLS&linear head来代替经典的Verbalizer.

7. 知识整理（知识点，要读的文献，摘取原文）

verbalizer是标签词映射, 将**[MASK]**位置上对于词表中词汇的预测转化成分类标签。例如{POLITICS: “politics”, SPORTS: “sports”} 。

Prompt tuning是一种只微调连续提示的想法。具体来说，Liu et al. (2021b); Lester et al. (2021) 提出在原始的输入词嵌入序列上增加可训练的连续嵌入。

摘取学习的原文：

Deep prompt tuning increases the capacity of continuous prompts and closes the gap to fine-tuning across various settings, especially for small models and hard tasks.

深度prompt tuning增加了连续提示的能力，并缩小了在各种设置fine-tuning的差距，特别是对于小模型和硬任务。

关于SuperGLUE task：

BoolQ

BoolQ是包含15942个示例的Yes/No问题的问答数据集。这些问题是自然产生的–它们是在无提示和无约束的设置中产生的。 每个示例都是（问题、段落、答案）的三元组，页面标题作为可选的附加上下文。

{“question”: “is windows movie maker part of windows essentials”,

“passage”: “Windows Movie Maker – Windows Movie Maker (formerly known as Windows Live Movie Maker in Windows 7) is a discontinued video editing software by Microsoft. It is a part of Windows Essentials software suite and offers the ability to create and edit videos as well as to publish them on OneDrive, Facebook, Vimeo, YouTube, and Flickr.”,

“idx”: 2,

“label”: true}

CB: Commitment Bank
CB一个短文本语料库，其中至少有一个句子包含一个嵌入从句。其中每个嵌入从句都标注了该从句的预期的真实性程度。所得到的任务框架是三类文本蕴涵（three-class textual entailment），其样本来自《华尔街日报（Wall Street Journal）》、英国国家语料库（British National Corpus）的小说、Switchboard。每个样本都包含一个含有一个嵌入从句的前提（premise），对应的假设（hypothesis）则是该从句的提取。SuperCLUE 使用了该数据集的一个子集，该子集中注释之间的一致程度超过 0.85。这些数据不很平衡（中性样本相对较少），所以评估指标是准确度和 F1 分数，其中多类 F1 分数是每类 F1 分数的不加权的平均值。
实际上，CB是一个文本蕴含任务。模型处理前提(premise)后，检查基于前提的假设(hypothesis)是中性的还是蕴含的还是相矛盾的。

{“premise”: “The Susweca. It means ‘‘dragonfly’’ in Sioux, you know.
Did I ever tell you that’s where Paul and I met?”
“hypothesis”:“Susweca is where she and Paul met,”
“label”: “entailment”, “idx”: 77}

COPA: Choice of Plausibe Ansewers
数据集代表了一项因果推理任务，其会向系统提供一个前提句子和两个可能的可选项。系统必须选择与前提句子有更可信因果关系的可选项。用于构建可选项的方法要确保需要因果推理才能解决该任务。样本要么针对前提句子的可能原因，要么则是可能结果，再加上模型的两个实例类型之间的简单问题消岐。

Premise(前提): I knocked on my neighbor’s door.
What happened as a result?

Alternative 1(两个实例): My neighbor invited me in.
Alternative 2: My neighbor left his house.

MultiRC: Multi-Sentence Reading Comprehension
MultiRC是一项真假问答任务。每个样本都包含一个上下文段落、一个有关该段落的问题和一个该问题的可能答案的列表，这些答案必须标注了「真（true）」或「假（false）」。问答是很常见的问题，有很多数据集。
这里选择 MultiRC 的原因包括：
（1）每个问题都可以有多个可能的正确答案，所以每个问答对都必须独立于其它问答对进行评估；（2）问题的设计方式使得每个问题的解答都需要从多个上下文句子中提取事实；
（3）相比于基于范围的抽取型问答，这个数据集的问答对格式更匹配其它 SuperGLUE 任务的 API。
这些段落取自七个领域，包括新闻、小说和历史文本。评估指标是每个问题的正确答案集的 macro-average F1 分数（F1m）和在所有答案选项上的 binary F1 分数（F1a）。例如给定文本：

“Text”: “text”: "The rally took place on October 17, the shooting on
February 29. Again, standard filmmaking techniques are interpreted as
smooth distortion: “Moore works by depriving you of context and
guiding your mind to fill the vacuum – with completely false ideas.
It is brilliantly, if unethically, done.” As noted above, the “from
my cold dead hands” part is simply Moore’s way to introduce Heston.
Did anyone but Moore’s critics view it as anything else? He certainly
does not “attribute it to a speech where it was not uttered” and, as
noted above, doing so twice would make no sense whatsoever if Moore
was the mastermind deceiver that his critics claim he is. Concerning
the Georgetown Hoya interview where Heston was asked about Rolland,
you write: “There is no indication that [Heston] recognized Kayla
Rolland’s case.” This is naive to the extreme – Heston would not be
president of the NRA if he was not kept up to date on the most
prominent cases of gun violence. Even if he did not respond to that
part of the interview, he certainly knew about the case at that point.
Regarding the NRA website excerpt about the case and the highlighting
of the phrase “48 hours after Kayla Rolland is pronounced dead”:
This is one valid criticism, but far from the deliberate distortion
you make it out to be; rather, it is an example for how the facts can
sometimes be easy to miss with Moore’s fast pace editing. The reason
the sentence is highlighted is not to deceive the viewer into
believing that Heston hurried to Flint to immediately hold a rally
there (as will become quite obvious), but simply to highlight the
first mention of the name “Kayla Rolland” in the text, which is in
this paragraph. "

以及答案

“question”: “When was Kayla Rolland shot?” “answers”: [{“text”: “February 17”, “idx”: 168, “label”: 0}, {“text”: “February 29”, “idx”: 169, “label”: 1}, {“text”: “October 29”, “idx”: 170, “label”: 0}, {“text”: “October 17”, “idx”: 171, “label”: 0}, {“text”: “February 17”, “idx”: 172, “label”: 0}], “idx”: 26}, {“question”: “Who was president of the NRA on February 29?”, “answers”: [{“text”: “Charleton Heston”, “idx”: 173, “label”: 1}, {“text”: “Moore”, “idx”: 174, “label”: 0}, {“text”: “George Hoya”, “idx”: 175, “label”: 0}, {“text”: “Rolland”, “idx”: 176, “label”: 0}, {“text”: “Hoya”, “idx”: 177, “label”: 0}, {“text”: “Kayla”, “idx”: 178,“label”: 0}], “idx”: 27}