5种常用于LLM的令牌遮蔽技术介绍以及Pytorch的实现-Toy模板网

这篇具有很好参考价值的文章主要介绍了5种常用于LLM的令牌遮蔽技术介绍以及Pytorch的实现。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

本文将介绍大语言模型中使用的不同令牌遮蔽技术，并比较它们的优点，以及使用Pytorch实现以了解它们的底层工作原理。

令牌掩码Token Masking是一种广泛应用于语言模型分类变体和生成模型训练的策略。BERT语言模型首先使用，并被用于许多变体(RoBERTa, ALBERT, DeBERTa…)。

而Text Corruption是一种更大的令牌遮蔽策略。在BART研究论文中，进行了大量实验来训练具有不同策略的编码器-解码器生成模型。

5种常用于LLM的令牌遮蔽技术介绍以及Pytorch的实现,pytorch,人工智能,python,深度学习,大语言模型

在进入正题之前，我们先介绍大型语言模型(llm)中掩码策略的背景

从监督到自监督

语言模型的初始训练中使用了大量文本，其目标是使模型学会正确地表示语言，并将这种知识隐式地存储在其参数权重中。

大量的文本必须具有用于训练的标签，因为必须在处理模型输入数据并使用参考数据之后计算损失（交叉熵）。但是注释如此大量的数据是不可行的。所以智能将问题从监督学习变为自动生成标签的自监督问题。

在这种情况下，被破坏的文本序列作为模型的训练输入，而所有或部分原始序列作为训练数据的标签。这样通过自动生成的标签，模型学习与每个训练示例关联的标签，就不需要手动的注释数据。

在Text Corruption中（特别是在Token Masking、Token Deletion和Text Infilling中），每个单词可能会按照固定概率（通常约为15-20%）进行遮蔽。这个概率保持较低，以便模型即使在序列被损坏的情况下也能学习每个句子的上下文。

还有一些技术，如Sentence Permutation 或Document Rotation，不会专注于按照一定概率遮蔽单词，我们后面会介绍。

在训练语言模型时，标签会根据是分类模型（仅编码器）还是生成模型（编码器-解码器）而变化。在分类模型中，使用的标签只关注输入中被遮蔽的区域。因此如果一个词在整个句子中被屏蔽，标签只是单个单词。而对于生成模型，由于模型必须能够连续生成文本，输出标签是初始未损坏的序列，关注整个序列本身。

环境配置

我们已经简要介绍了使用Text Corruption训练语言模型的一些背景知识，下面我们开始使用示例代码来介绍不同的Text Corruption技术。

我们将使用Stanza，一个由斯坦福NLP开发的库，其中包含不同的NLP工具，这些工具对我们的预处理非常有用。

 importstanza
 stanza.download('en')
 
 # Text used in our examples
 text="Huntington's disease is a neurodegenerative autosomal disease 
 resultsduetoexpansionofpolymorphicCAGrepeatsinthehuntingtingene. 
 Phosphorylationofthetranslationinitiationfactor4E-BPresultsinthe
 alterationofthetranslationcontrolleadingtounwantedproteinsynthesis
 andneuronalfunction. Consequencesofmutanthuntington (mhtt) gene
 transcriptionarenotwellknown. Variabilityofageofonsetisan
 importantfactorofHuntington's disease separating adult and juvenile types. 
 Thefactorswhicharetakenintoaccountare-geneticmodifiers, maternal
 protectioni.eexcessivepaternaltransmission, superiorageinggenes
 andenvironmentalthreshold. Amajorfocushasbeengiventothemolecular
 pathogenesiswhichincludes-motordisturbance, cognitivedisturbanceand
 neuropsychiatricdisturbance. Thediagnosisparthasalsobeentakencareof. 
 Thisincludesgenetictestingandbothprimaryandsecondarysymptoms. 
 ThepresentreviewalsofocusesonthegeneticsandpathologyofHuntington's 
 disease."
 
 
 # We will use a stanza model for getting each different sentence 
 # as an element of the list
 nlp=stanza.Pipeline('en', use_gpu=False)
 doc=nlp(text)
 sentences= [sentence.textforsentenceindoc.sentences]

Token Masking

令牌掩码用替换文本中的随机单词

5种常用于LLM的令牌遮蔽技术介绍以及Pytorch的实现,pytorch,人工智能,python,深度学习,大语言模型

这是从BERT引入的策略，它包括通过屏蔽随机单词来破坏输入序列，这些单词将在训练期间用作输出标签。

在分类模型中，我们可以直接使用Huggingface的DataCollatorForLanguageModeling类来生成必要的标签，这样就可以训练像BERT或RoBERTa这样的模型。

 fromtransformersimportAutoTokenizer, DataCollatorForLanguageModeling
 importtorch
 
 defload_dataset_mlm(sentences, tokenizer_class=AutoTokenizer, 
                      collator_class=DataCollatorForLanguageModeling, 
                      mlm=True, mlm_probability=0.20):
     tokenizer=tokenizer_class.from_pretrained('google-bert/bert-base-uncased')
     inputs=tokenizer(sentences, return_tensors='pt', padding=True, 
                        truncation=True)
     
     # Random masking configuration
     data_collator=collator_class(
         tokenizer=tokenizer, 
         mlm=mlm,  
         mlm_probability=mlm_probability
     )
 
     """The collator expects a tuple of tensors, so you have to split 
     the input tensors and then remove the first dimension and pass it 
     to a tuple. """
     tuple_ids=torch.split(inputs['input_ids'], 1, dim=0)
     tuple_ids=list(tuple_ids)
     fortensorinrange(len(tuple_ids)):
         tuple_ids[tensor] =tuple_ids[tensor].squeeze(0)
     tuple_ids=tuple(tuple_ids)
     
     # Get input_ids, attention_masks and labels for each sentence.
     batch=data_collator(tuple_ids)
     returnbatch['input_ids'], inputs['attention_mask'], batch['labels']
 
 
 input_ids, attention_mask, labels=load_dataset_mlm(sentences)
 
 """
 input_ids[0]:
 tensor([  101, 16364,  1005,  1055,   103,  2003,  1037,   103, 10976,  3207,
           103, 25284,   103, 25426, 16870,  4295,  3463,  2349,  2000,   103,
          1997, 26572, 18078,  6187,  2290, 17993,  1999,  1996,  5933,  7629,
           103,   103,   102,     0,     0])
 
 attention_mask[0]:
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0])
 
 labels[0]:
 tensor([ -100,  -100,  -100,  -100,  4295,  -100,  -100, 11265,  -100,  -100,
          6914,  -100,  8285,  -100,  2389,  -100,  -100,  -100,  -100,  4935,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          4962,  1012,  -100,  -100,  -100]))
 
 """

生成的inputs_ids对原始文本的每个标记都是整数。一个特殊的标记表示被屏蔽的单词(在BERT中，这个标记是103)。这个特殊的标记根据所使用的语言模型而变化，因此不同的标记器将返回注意掩码的不同标识符。

Huggingface还在模型中使用不同的操作分配唯一的令牌，因此用“-100”表示的令牌表示模型应该忽略它们。

对于像BART这样的生成模型，我们可以使用DataCollatorForLanguageModeling类实现令牌屏蔽策略。但是需要一些小的更改，以使标记适应生成模型。

 fromtransformersimportBartTokenizer, DataCollatorForLanguageModeling
 importtorch
 
 defload_dataset_mlm(sentences, tokenizer_class=BartTokenizer, 
                      collator_class=DataCollatorForLanguageModeling, 
                      mlm=True, mlm_probability=0.20):
     tokenizer=tokenizer_class.from_pretrained('facebook/bart-base')
     inputs=tokenizer(sentences, return_tensors='pt', padding=True, 
                        truncation=True)
     
     # Random masking configuration
     data_collator=collator_class(
         tokenizer=tokenizer, 
         mlm=mlm,  # True for Masked Language Modelling
         mlm_probability=mlm_probability  # Chance for every token to get masked
     )
 
     """The collator expects a tuple of tensors, so you have to split 
     the input tensors and then remove the first dimension and pass it 
     to a tuple. """
     tuple_ids=torch.split(inputs['input_ids'], 1, dim=0)
     tuple_ids=list(tuple_ids)
     fortensorinrange(len(tuple_ids)):
         tuple_ids[tensor] =tuple_ids[tensor].squeeze(0)
     tuple_ids=tuple(tuple_ids)
     
     # Get input_ids, attention_masks and labels for each sentence.
     batch=data_collator(tuple_ids)
     batch['labels'] =inputs['input_ids']
     returnbatch['input_ids'], inputs['attention_mask'],  batch['labels']
 
 input_ids, attention_mask, labels=load_dataset_mlm(sentences)
 
 """
 input_ids[0]:
 tensor([    0, 38831,  2577,  1054,    18,  2199,    16,    10, 14913, 28904,
          5777,  3693, 32226, 38868,  2199,   775,   528,     7,  2919,     9,
         48052,   636,   230,  3450, 35315,    11,     5, 50264, 50264, 50264,
             4,     2])
 
 attention_mask[0]:
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1])
 
 labels[0]:
 tensor([    0, 38831,  2577,  1054,    18,  2199,    16,    10, 14913, 28904,
          5777,  3693, 32226, 38868,  2199,   775,   528,     7,  2919,     9,
         48052,   636,   230,  3450, 35315,    11,     5,  8217, 24276, 10596,
             4,     2])
 """

每个输入标记都标记着与之对应的标记，无论它是否被屏蔽。这是因为与分类模型不同，模型必须能够基于给定给模型的序列生成文本序列。在BART的情况下，表示屏蔽的标记的ID是50264。

Token Deletion

使用标记删除 Token Deletion，模型必须学习确切的位置和缺失的词是什么，因此它必须比仅使用Token Masking学习更多的特征。

5种常用于LLM的令牌遮蔽技术介绍以及Pytorch的实现,pytorch,人工智能,python,深度学习,大语言模型

这种策略使用了一种不同的屏蔽方法。以一定的概率一个词从原始文本序列中被移除，因此模型必须找到缺失的单词及其位置。标准的屏蔽方法不会学习位置，因为屏蔽已经在模型的输入中指示

 deftoken_deletion(sentences, tokenizer_class=BartTokenizer, collator_class=DataCollatorForLanguageModeling, 
                  mlm=True, mlm_probability=0.20):
     tokenizer=tokenizer_class.from_pretrained('facebook/bart-base')
     inputs=tokenizer(sentences, return_tensors='pt', padding=True, truncation=True)
     
     data_collator=collator_class(
         tokenizer=tokenizer, 
         mlm=mlm,
         mlm_probability=mlm_probability
     )
 
     tuple_ids=torch.split(inputs['input_ids'], 1, dim=0)
     tuple_ids=list(tuple_ids)
     fortensorinrange(len(tuple_ids)):
         tuple_ids[tensor] =tuple_ids[tensor].squeeze(0)
     tuple_ids=tuple(tuple_ids)
 
     batch=data_collator(tuple_ids)
 
     # We use the initial inputs as labels
     batch['labels'] =batch['input_ids'].clone()
     
     # We remove tokens with mask identifier and thus make token deletion
     # Change the value to the mask identifier of the specific token model
     # It is necessary to know the identifier of the mask token for 
     # that specific model
     mask=batch['input_ids'] !=50264
     initial_size=batch['input_ids'].size(1)
     total_sentences=batch['input_ids'].size(0)
 
     # When we remove the specific token, we must fill with the padding 
     # token otherwise the tensor size is not respected.
     foriinrange(total_sentences):
         new_tensor=batch['input_ids'][i][mask[i]]
         new_tensor=F.pad(new_tensor, (0, initial_size-new_tensor.size(0)), value=1)
         batch['input_ids'][i] =new_tensor
         attention_mask=batch['input_ids'][i] ==1
         inputs['attention_mask'][i][attention_mask] =0
         
     returnbatch['input_ids'], inputs['attention_mask'], batch['labels']
 
 input_ids, attention_mask, labels=token_deletion(sentences)
 
 """
 input_ids[0]:
 tensor([    0, 38831,  2577,  1054,  2199, 14913, 28904,  3693, 32226, 38868,
          2199,   775,   528,     7,  2919,     9, 23404,   636,   230, 35315,
            11,     5, 24276, 10596,     4,     2,     1,     1,     1,     1,
             1,     1])
 
 attention_mask[0]:
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 0, 0, 0, 0, 0, 0])
 
 labels[0]:
 tensor([    0, 38831,  2577,  1054, 50264,  2199, 50264, 50264, 14913, 28904,
         50264,  3693, 32226, 38868,  2199,   775,   528,     7,  2919,     9,
         23404,   636,   230, 50264, 35315,    11,     5, 50264, 24276, 10596,
             4,     2])
 
 """

当使用Token Deletion训练BART时，长序列用于问答、摘要生成任务和会话任务会有一定的提高。

Text Infilling

文本填充 Text Infilling允许模型学习每个屏蔽位置可以有多少个单词。而先前的方法假设每个屏蔽位置只有一个单词。

5种常用于LLM的令牌遮蔽技术介绍以及Pytorch的实现,pytorch,人工智能,python,深度学习,大语言模型

Text Infilling与Token Masking类似，因为我们会以一定的概率在原始文本上使用屏蔽。但是不同之处在于屏蔽可以覆盖多个单词。在BART中，屏蔽是用泊松分布 lambda = 3 进行的；这意味着平均而言，每次对句子中的文本进行屏蔽时，会有三个单词被包含在一个单个的标记中，但由于这是一个概率分布，可能会有更多或更少的屏蔽单词。

5种常用于LLM的令牌遮蔽技术介绍以及Pytorch的实现,pytorch,人工智能,python,深度学习,大语言模型

我们将使用Numpy库和特定于我们的语言模型(在本例中是BART)的标记器来实现文本填充。

 importnumpyasnp
 fromtransformersimportBartTokenizer
 
 deftext_infilling(sentence, probability=0.2, poisson_lambda=3):
     # We'll use a binary mask to determine which words to replace
     mask=np.random.choice([0, 1], size=len(sentence), p=[1-probability, probability])
 
     # Now we'll replace the chosen words with a mask token
     # We'll also use a Poisson distribution to determine the length of the spans to mask
     foriinrange(len(mask)):
         ifmask[i] ==1:
             span_length=np.random.poisson(poisson_lambda)
             forjinrange(span_length):
                 ifi+j<len(sentence):
                     sentence[i+j] ="<mask>"
 
     infilled_sentence= []
     fortokeninrange(len(sentence)):
         ifsentence[token] =="<mask>":
             iftoken<len(sentence)-1:
                 ifsentence[token+1] =="<mask>":
                     continue
                 else:
                     infilled_sentence.append(sentence[token])
             else:
                 infilled_sentence.append(sentence[token])
         else:
             infilled_sentence.append(sentence[token])
     return" ".join(infilled_sentence)
 
 deftext_infilling_input(masked_sentences, sentences, tokenizer_class=BartTokenizer):
     tokenizer=tokenizer_class.from_pretrained('facebook/bart-base')
     inputs=tokenizer(masked_sentences, return_tensors='pt', padding=True, truncation=True)
     labels=tokenizer(sentences, return_tensors='pt', padding=True, truncation=True)
     returninputs['input_ids'], inputs['attention_mask'], labels['input_ids']
 
 input_ids, attention_mask, labels=text_infilling_input(masked_sentences, sentences)
 
 """
 input_ids[0]:
 tensor([    0, 50264,    16, 50264,  2199,   775,   528, 50264, 48052,   636,
         50264,  8217, 24276, 10596,     4,     2,     1,     1,     1,     1,
             1,     1,     1])
 
 attention_mask[0]:
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0])
 
 labels[0]:
 tensor([    0, 38831,  2577,  1054,    18,  2199,    16,    10, 14913, 28904,
          5777,  3693, 32226, 38868,  2199,   775,   528,     7,  2919,     9,
         48052,   636,   230,  3450, 35315,    11,     5,  8217, 24276, 10596,
             4,     2])
 
 """

Text Infilling比Token Deletion更能改善BART语言模型的结果，在问题回答、文本摘要和会话任务中提供更好的生成。

Sentence Permutation

语言模型的输入文本被分成随机重新排列的句子，模型需要找出原始的顺序。

5种常用于LLM的令牌遮蔽技术介绍以及Pytorch的实现,pytorch,人工智能,python,深度学习,大语言模型

在Sentence Permutation中，考虑适合模型输入序列的句子数量是至关重要的(在小型模型中，输入序列在512到1024之间)。在确定符合序列的句子数量之后，需要将它们分离到一个列表或数组中，并随机选择，而不重复其中任何一个。

 # It selects the first "number_sentences" within a given set of "sentences" 
 # and returns those sentences in a random order.
 defsentence_permutation(sentences, number_sentences):
     new_sentences=sentences[:number_sentences]
     random.shuffle(new_sentences)
     new_sentences=sentence_joiner(new_sentences)
     returnnew_sentences
 
 defpermuted_data_generation(sentences: list, total_sentences: int):
     training_sentences= []
     training_labels= []
     sentences_copy=sentences.copy()
     # We can apply sentence_permutation a number of times equal to the 
     # size of the list - 1 to get an example with each new sentence in 
     # the text, removing the oldest one.
     for_inrange(len(sentences)-total_sentences+1):
         new_sentences=sentence_permutation(sentences_copy, total_sentences)
         joined_sentences=sentence_joiner(sentences_copy[:total_sentences])
         sentences_copy=sentences_copy[1:]
         training_sentences.append(new_sentences)
         training_labels.append(joined_sentences)
 
     returntraining_sentences, training_labels
 
 
 defpermutation_training(sentences: list, sentences_labels: list, 
                          tokenizer_class=BartTokenizer, 
                          collator_class=DataCollatorForLanguageModeling, 
                         mlm=True, mlm_probability=0.0):
     # We get input_ids and attention mask from the permuted sentences
     input, attention_mask, _=load_dataset_mlm(sentences, tokenizer_class, collator_class, mlm, mlm_probability)
     
     # Labels from the original sentences
     labels, _, _=load_dataset_mlm(sentences_labels, tokenizer_class, collator_class, mlm, mlm_probability)
 
     returninput.squeeze(0), attention_mask.squeeze(0), labels.squeeze(0)
 
 input_ids, attention_mask, labels=permutation_training(training_sentences, training_labels_sentences)
 
 """
 input_ids[0]:
 tensor([    0, 38831,  2577,  1054,    18,  2199,    16,    10, 14913, 28904,
          5777,  3693, 32226, 38868,  2199,   775,   528,     7,  2919,     9,
         48052,   636,   230,  3450, 35315,    11,     5,  8217, 24276, 10596,
             4,  2585, 33430,  8457,     9, 41419,  8217,  1054,    36,   119,
         49491,    43, 10596, 37118,    32,    45,   157,   684,     4,  4129,
         33839,  4405, 35019,     9,     5, 19850, 34939,  3724,   204,   717,
            12, 21792,   775,    11,     5, 39752,     9,     5, 19850,   797,
           981,     7, 15067,  8276, 37423,     8, 46282,  5043,     4,     2])
 
 attention_mask[0]:
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1])
 
 labels[0]:
 tensor([    0, 38831,  2577,  1054,    18,  2199,    16,    10, 14913, 28904,
          5777,  3693, 32226, 38868,  2199,   775,   528,     7,  2919,     9,
         48052,   636,   230,  3450, 35315,    11,     5,  8217, 24276, 10596,
             4,  4129, 33839,  4405, 35019,     9,     5, 19850, 34939,  3724,
           204,   717,    12, 21792,   775,    11,     5, 39752,     9,     5,
         19850,   797,   981,     7, 15067,  8276, 37423,     8, 46282,  5043,
             4,  2585, 33430,  8457,     9, 41419,  8217,  1054,    36,   119,
         49491,    43, 10596, 37118,    32,    45,   157,   684,     4,     2])
 
 """

我们对于模型的每个数据输入，删除原始序列中出现的第一个句子，然后在执行基于要选择的固定句子数目的句子排列之前，将接下来的句子添加进去。这样虽然重新排列了输入序列中的句子，但保持了一个每个新例子中都会出现一个新的句子，并删除最旧的句子的上下文窗口。

Document Rotation

当旋转一个文档时，选择一个特定的词，并将其设定为起始词，而所有之前的词都被粘贴到文本的末尾。

5种常用于LLM的令牌遮蔽技术介绍以及Pytorch的实现,pytorch,人工智能,python,深度学习,大语言模型

如果要应用Document Rotation，必须考虑到每个批次使用的维度。在应用填充的情况下，这个填充不能与文档的其余部分一起旋转，而是必须保持其原始位置，同时整个文档旋转。

 defsentence_joiner(sentences: list):
   return' '.join(sentences)
 
 # With this function we gather as many sentences as we want to form the input data to the tokenizer.
 defrotated_data_generation(sentences: list, total_sentences: int):
   training_sentences= []
   sentences_copy=sentences.copy()
   for_inrange(len(sentences)-total_sentences+1):
     new_sentences=sentences_copy[:total_sentences]
     new_sentences=sentence_joiner(new_sentences)
     sentences_copy=sentences_copy[1:]
     training_sentences.append(new_sentences)
   returntraining_sentences
 
 # Apply this function over the rotated sentences from previous function
 defdocument_rotation_training(sentences, tokenizer_class=BartTokenizer):
   tokenizer=tokenizer_class.from_pretrained('facebook/bart-base')
   tokens=tokenizer(sentences, return_tensors='pt', padding=True, truncation=True)
   tokens['input_ids'] =tokens['input_ids'].squeeze(0)
   tokens['labels'] =tokens['input_ids'].clone()
 
   iterations=tokens['input_ids'].size(0)
   foriinrange(iterations):
     # Get the attention mask and convert to list
     attention_mask=tokens['attention_mask'][i].tolist()
     # Calculate the position where padding starts
     if0inattention_mask:
       padding_start_position=attention_mask.index(0)
     else:
       padding_start_position=False
     # We take into account the position of the padding so as not to rotate it along with the rest of the document.
     ifpadding_start_position:
       random_token=torch.randint(1, padding_start_position-1, (1,))
       tokens['input_ids'][i] =torch.cat((tokens['input_ids'][i][0].unsqueeze(0), #initial token
                                       tokens['input_ids'][i][random_token.item():padding_start_position-1], #from random to padding
                                       tokens['input_ids'][i][1:random_token.item()], #from 1 to random
                                       tokens['input_ids'][i][padding_start_position-1:-1],
                                       tokens['input_ids'][i][-1].unsqueeze(0)), 0)
                                         
     # If there is no padding, we rotate the document without taking the padding into account.
     else:
       random_token=torch.randint(1, tokens['input_ids'].size(0)-1, (1,))
       tokens['input_ids'][i] =torch.cat((tokens['input_ids'][i][0].unsqueeze(0), #initial token
                                       tokens['input_ids'][i][random_token.item():-1], #from random to end
                                       tokens['input_ids'][i][1:random_token.item()],
                                       tokens['input_ids'][i][-1].unsqueeze(0)), 0)
   returntokens['input_ids'], tokens['attention_mask'].squeeze(0), tokens['labels']
 
 data=rotated_data_generation(sentences, 3)
 input_ids, attention_mask, labels=document_rotation_training(data)
 
 """
 input_ids[2]:
 tensor([    0,  2433,    61,    32,   551,    88,  1316,    32,    12,  4138,
         15557, 47605,     6, 22835,  2591,   939,     4,   242, 10079, 38422,
          9235,     6, 10295, 22540, 14819,     8,  3039, 11543,     4,   347,
         37347,  8457,     9, 41419,  8217,  1054,    36,   119, 49491,    43,
         10596, 37118,    32,    45,   157,   684,     4, 41058,  4484,     9,
          1046,     9, 23808,    16,    41,   505,  3724,     9, 18073,    18,
          2199, 18246,  4194,     8, 13430,  3505,     4,    20,     2,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1])
 
 attention_mask[2]:
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0])
 
 labels[2]:
 tensor([    0,   347, 37347,  8457,     9, 41419,  8217,  1054,    36,   119,
         49491,    43, 10596, 37118,    32,    45,   157,   684,     4, 41058,
          4484,     9,  1046,     9, 23808,    16,    41,   505,  3724,     9,
         18073,    18,  2199, 18246,  4194,     8, 13430,  3505,     4,    20,
          2433,    61,    32,   551,    88,  1316,    32,    12,  4138, 15557,
         47605,     6, 22835,  2591,   939,     4,   242, 10079, 38422,  9235,
             6, 10295, 22540, 14819,     8,  3039, 11543,     4,     2,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1])
 
 """

类似于序列排列，我们可以在每个数据输入中移除最旧的句子，并添加一个新句子，从而保持上下文窗口。