自然语言基础 IMDB下的 MLM (掩码模型) & Bert Fine-tuning (模型微调)

这篇具有很好参考价值的文章主要介绍了自然语言基础 IMDB下的 MLM (掩码模型) & Bert Fine-tuning (模型微调)。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

自然语言基础 IMDB下的 MLM (掩码模型) & Bert Fine-tuning (模型微调)

本文是Hugging Face 上 NLP的一篇代码教程，通过imdb数据集， Fine-tuning微调 Bert预训练模型。

涉及关键词包括: MLM, Bert, Fine-tuning, IMDB, Huggingface Repo

1.Fine-tuning

微调的方式是通过调整训练模型的学习率来重新训练模型，这个来自早期 ACL 2018的一篇paper：

《Universal Language Model Fine-tuning for Text Classification》

该文章大致提出以下 fine-tuning方法, 这里需要注意的是，文章的背景模型是在RNN下完成的，这类模型层数少，且数据是一个序列，即输入和输出存在一个时间步T：

discriminative fine-tuning

不同层设置不同的学习率，最后一层学习率最高，之后前一层的学习率经验性地比后一层降低 2.6倍

slanted traingular learning rates

学习率随着迭代次数变动，在200iterations从 0.002 升高到 0.01，之后逐步下降，其实逐步下降的过程类似当前的退火算法

gradual unfreezing

只训练一部分层，比如 CV 分类任务就是添加 2个blocks (FC+BN+Dropout) 并只训练这两个blocks 完成的

文章提出对pooling层的输出进行 concat, 把每一时间步的hiddens state 按 maxpool 和 meanpool 进行拼接, 在拼接当前时间步的hiddens state 作为最终的时间步

因为一次微调所有层会造成灾难性遗忘, 这里通过先训练最后一层（输出层），再逐步解冻前面的层来让训练稳定并逐步收敛

2. DistilBERT

DistilBERT 是通过知识蒸馏算法(Knowledge Distillation), 用大参数版本的Bert 作为 "教师“ 去训练一个小参数版本的 “学生” 模型。

学生模型的参数小很多但性能并没有明显的降低。

显示参数:

from transformers import AutoModelForMaskedLM,AutoTokenizer

model_checkpoint = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint, cache_dir='./cache/')
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

distilbert_num_parameters = model.num_parameters() / 1_000_000
print(f"'>>> DistilBERT number of parameters: {round(distilbert_num_parameters)}M'")
print(f"'>>> BERT number of parameters: 110M'")

# '>>> DistilBERT number of parameters: 67M'
# '>>> BERT number of parameters: 110M'

处理文本

text = "This is a great [MASK]."

inputs = tokenizer(text, return_tensors="pt") 
# {'input_ids': tensor([[ 101, 2023, 2003, 1037, 2307,  103, 1012,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

print(inputs.word_ids(0)) # [None, 0, 1, 2, 3, 4, 5, None], 一个sentensce中 word的绝对位置


token_logits = model(**inputs).logits
# torch.Size([1, 8, 30522])

# Find the location of [MASK] and extract its logits

mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1] # 返回的第一个值是0，第二个才是 mask_token_id
mask_token_logits = token_logits[0, mask_token_index, :] # [1, 30522]， mask位置上 30522个词的概率


# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist() # 选择mask位置上 30522个词中概率最高的前5个
#torch.return_types.topk(values=tensor([[7.0727, 6.6514, 6.6425, 6.2530, 5.8618]], grad_fn=<TopkBackward0>), indices=tensor([[3066, 3112, 6172, 2801, 8658]]))

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'") # 把这概率最高的 5个tokens 的 id （indices） 解码为单词
# # '>>> This is a great deal.'
# # '>>> This is a great success.'
# # '>>> This is a great adventure.'
# # '>>> This is a great idea.'
# # '>>> This is a great feat.'

3. Imdb 数据集

3.1 Overview

这个英文名叫 Large Movie Review Dataset, 即 IMDB, 包括 100,000 条影评，分别是训练集(train): 25,000条，测试集(test): 25,000条，无标签集(unsupervised): 50,000条。

训练集和测试集有标签，0代表负面评价，1代表正面评价，无标签集中的标签统一都是-1。

imdb_dataset


imdb_dataset = load_dataset("imdb", cache_dir = './datasets')
print(imdb_dataset)

# DatasetDict({
#     train: Dataset({
#         features: ['text', 'label'],
#         num_rows: 25000
#     })
#     test: Dataset({
#         features: ['text', 'label'],
#         num_rows: 25000
#     })
#     unsupervised: Dataset({
#         features: ['text', 'label'],
#         num_rows: 50000
#     })
# })

# print(len(imdb_dataset['train']['text'])) # 25,000个字符串，内容是电影的reviews
# print(imdb_dataset['train']['label'][:10]) # 25,000个标签的前10个，0 denotes a negative review, while a 1 corresponds to a positive one.

影评为字符串形式，三个集合的第一个元素如下：


>imdb_dataset['train']['text'][0]

"I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, even then it's not shot like some cheaply made porno. While my countrymen mind find it shocking, in reality sex and nudity are a major staple in Swedish cinema. Even Ingmar Bergman, arguably their answer to good old boy John Ford, had sex scenes in his films.<br /><br />I do commend the filmmakers for the fact that any sex shown in the film is shown for artistic purposes rather than just to shock people and make money to be shown in pornographic theaters in America. I AM CURIOUS-YELLOW is a good film for anyone wanting to study the meat and potatoes (no pun intended) of Swedish cinema. But really, this film doesn't have much of a plot."

>imdb_dataset['test']['text'][0]

"I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn't match the background, and painfully one-dimensional characters cannot be overcome with a 'sci-fi' setting. (I'm sure there are those of you out there who think Babylon 5 is good sci-fi TV. It's not. It's clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It's really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it's rubbish as they have to always say "Gene Roddenberry's Earth..." otherwise people would not continue watching. Roddenberry's ashes must be turning in their orbit as this dull, cheap, poorly edited (watching it without advert breaks really brings this home) trudging Trabant of a show lumbers into space. Spoiler. So, kill off a main character. And then bring him back as another actor. Jeeez! Dallas all over again."

>imdb_dataset['unsupervised']['text'][0]

"This is just a precious little diamond. The play, the script are excellent. I cant compare this movie with anything else, maybe except the movie "Leon" wonderfully played by Jean Reno and Natalie Portman. But... What can I say about this one? This is the best movie Anne Parillaud has ever played in (See please "Frankie Starlight", she's speaking English there) to see what I mean. The story of young punk girl Nikita, taken into the depraved world of the secret government forces has been exceptionally over used by Americans. Never mind the "Point of no return" and especially the "La femme Nikita" TV series. They cannot compare the original believe me! Trash these videos. Buy this one, do not rent it, BUY it. BTW beware of the subtitles of the LA company which "translate" the US release. What a disgrace! If you cant understand French, get a dubbed version. But you'll regret later :)"

3.2 分块

tokenized_datasets

现将样本tokenize, 并增加一个key用于保存句子中的各单词位置(“word_ids”), 首位标签的位置id是 none.


def tokenize_function(examples):
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result

tokenized_datasets = imdb_dataset.map(tokenize_function, batched=True, remove_columns=["text", "label"])

# DatasetDict({
#     train: Dataset({
#         features: ['attention_mask', 'input_ids', 'word_ids'],
#         num_rows: 25000
#     })
#     test: Dataset({
#         features: ['attention_mask', 'input_ids', 'word_ids'],
#         num_rows: 25000
#     })
#     unsupervised: Dataset({
#         features: ['attention_mask', 'input_ids', 'word_ids'],
#         num_rows: 50000
#     })
# })

对于自回归 (auto-regressive) 和 MLM (masked language modeling)，通常共同的预处理步骤是将所有句子拼接 (concatenate) 起来。

之后再将整个语料库(corpus) 分割为大小相等 (equal size) 的块(chunk). 不直接编码能保证过长的样本不会因为被截断而丢失信息。

chunk

这个函数的参数是simples字典，先完成拼接，再根据 chunk_size 分为多个chunks，最后将一个新的key即 labels 用于拷贝 input_ids的值, 作为input_ids被masked后的标签。

def group_texts(examples): # chunk to 128 dim
    
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}

    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

lm_datasets = tokenized_datasets.map(group_texts, batched=True)

3.3 Mask

这里主要是将训练样本中，每个sequence的部分单词随机用 [mask] 替代一部分，一般为 15%-20%, 这里教程中有三个方法与之相关

insert_random_mask

列表通过 data_collator()转化为字典，字典的key加上加新字符"masked"

from torch.utils.data import DataLoader
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

def insert_random_mask(batch):
    features = [dict(zip(batch, t)) for t in zip(*batch.values())] #这个 features 是一个list，里面有 batch_sizes个字典, keys: 'input_ids', 'attention_mask', 'labels'
    masked_inputs = data_collator(features) # 列表转化为 字典, 同时加random mask
    # Create a new "masked" column for each column in the dataset
    return {"masked_" + k: v.numpy() for k, v in masked_inputs.items()} # 字典的keys加上 "masked" + 'input_ids', 'attention_mask', 'labels'

batch_size = 12
downsampled_dataset = lm_datasets["train"].train_test_split(train_size=1000, test_size=100, seed=42)
downsampled_dataset = downsampled_dataset.remove_columns(["word_ids"])

train_dataloader = DataLoader(
    downsampled_dataset["train"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)

for batch in train_dataloader:
    y2=insert_random_mask(batch)
    print(y2)
    break

default_data_collator

去掉 key 为 “word_ids” 的键值，在key 的 labels中 [mask]部分的值和"inout_ids"相同， labels全部负值为 -100

import collections
from transformers import default_data_collator

wwm_probability = 0.2

def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        feature["labels"] = new_labels

    return default_data_collator(features)

samples = [lm_datasets["train"][i] for i in range(2)] # 列表的形式
batch = whole_word_masking_data_collator(samples) # 去掉 key 为 "word_ids" 的键值，[mask]部分的值和"inout_ids"相同， labels全部负值为 -100

4. Fine-tuning DistilBERT

4.1 Setting


from transformers import TrainingArguments

batch_size = 64
# Show the training loss with every epoch
logging_steps = len(downsampled_dataset["train"]) // batch_size
# model_name = model_checkpoint.split("/")[-1]

output_dir='./output_dir'

training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    push_to_hub=True,
    fp16=True,
    logging_steps=logging_steps,
)

4.2 Accelerator Training

这个 Accelerator 类主要是方便多个GPU 并行训练

from accelerate import Accelerator
from torch.utils.data import DataLoader
from transformers import default_data_collator
from tqdm.auto import tqdm
import torch
import math

batch_size = 64
train_dataloader = DataLoader(
    downsampled_dataset["train"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)
eval_dataloader = DataLoader(eval_dataset, batch_size=batch_size, collate_fn=default_data_collator)

model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

from transformers import get_scheduler

num_train_epochs = 10
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)

        loss = outputs.loss
        losses.append(accelerator.gather(loss.repeat(batch_size)))

    losses = torch.cat(losses)
    losses = losses[: len(eval_dataset)]
    try:
        perplexity = math.exp(torch.mean(losses))
    except OverflowError:
        perplexity = float("inf")

    print(f">>> Epoch {epoch}: Perplexity: {perplexity}")

    # Save
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)

# Upload
# from huggingface_hub import get_full_repo_name
# repo_name = get_full_repo_name(model_name)
# print(repo_name) # disanda/distilbert-base-uncased-finetuned-imdb-accelerate
# repo = Repository(model_name, clone_from=repo_name)
# if accelerator.is_main_process:
#     tokenizer.save_pretrained(output_dir)
#     repo.push_to_hub(commit_message=f"Training in progress epoch {epoch}", blocking=False)

4.3 Easy Training

这个是通过现成的Trainer类来完成训练，加载对应的参数即可

另外模型训练后可以通过参数上传保存到 huggingface，这里需要:

登陆 hugginface 账号, 同时安装 git-lfs

huggingface-cli login

其输入在个人账号中 Settings 下的 Access Tokens 获得,
注意要申请一个新的 write token，原token只能用于read.
上传时需要调用 push_to_hub

import math
from transformers import Trainer
from huggingface_hub import Repository

model_name = "distilbert-base-uncased-finetuned-imdb-accelerate"

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)
trainer.train()

eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

trainer.train()

eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}") #通过loss比较微调前后的交叉相似度

trainer.push_to_hub() # 上传模型

5. Using Fine-tuned model

就是加载预训练模型并使用的过程文章来源地址https://www.toymoban.com/news/detail-552405.html

import torch
from transformers import pipeline
from transformers import AutoTokenizer
from transformers import AutoModelForMaskedLM

#model_pipe = pipeline("fill-mask", model="./output_dir")

model_checkpoint = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained('./output_dir')
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

text = "This is a great [MASK]."
inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
print(token_logits.shape)
print(inputs)
print(tokenizer.mask_token_id)
      
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1] # Find the location of [MASK] and extract its logits
mask_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist() # Pick the [MASK] candidates with the highest logits

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

# '>>> This is a great film.'
# '>>> This is a great movie.'
# '>>> This is a great idea.'
# '>>> This is a great one.'
# '>>> This is a great comedy.'