概念
数据集:
1、SetFit/emotion at main
模型:
1、microsoft/deberta-v3-large · Hugging Face
(备注:我都是把数据集和模型下载到本地的)
微调
代码:
from datasets import load_dataset
from sklearn.metrics import accuracy_score,f1_score
from transformers import Trainer,TrainingArguments
from transformers import AutoModelForSequenceClassification,AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("deberta-v3-large")
model = AutoModelForSequenceClassification.from_pretrained("deberta-v3-large",num_labels=6)
emotions = load_dataset("emotion")
def tokenize(batch):
return tokenizer(batch["text"],padding=True,truncation=True)
tokenized_emotions = emotions.map(tokenize,batched=True,batch_size=None)
tokenized_emotions.set_format("torch",columns=["input_ids","attention_mask","label"])
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
f1 = f1_score(labels,preds,average="weighted")
acc = accuracy_score(labels,preds)
return {"accuracy":acc,"f1":f1}
training_args = TrainingArguments(output_dir="result",)
trainer = Trainer(model=model,args=training_args,compute_metrics=compute_metrics,
train_dataset=tokenized_emotions["train"]
,eval_dataset=tokenized_emotions["validation"])
trainer.train()
结果:
在result目录下面生成checkpoint:
[ipa@comm-agi checkpoint-500]$ pwd
/data2/result/checkpoint-500
[ipao@comm-agi checkpoint-500]$ ll
total 5098944
-rw-rw-r-- 1 ipa ipa 1141 Sep 5 14:59 config.json
-rw-rw-r-- 1 ipa ipa 3480872507 Sep 5 14:59 optimizer.pt
-rw-rw-r-- 1 ipa ipa 1740408181 Sep 5 14:59 pytorch_model.bin
-rw-rw-r-- 1 ipa ipa 14575 Sep 5 14:59 rng_state.pth
-rw-rw-r-- 1 ipa ipa 627 Sep 5 14:59 scheduler.pt
-rw-rw-r-- 1 ipa ipa 533 Sep 5 14:59 trainer_state.json
-rw-rw-r-- 1 ipa ipa 4027 Sep 5 14:59 training_args.bin
验证:
代码:
from datasets import load_dataset
from sklearn.metrics import accuracy_score,f1_score
from transformers import Trainer,TrainingArguments
from transformers import AutoModelForSequenceClassification,AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("deberta-v3-large")
model = AutoModelForSequenceClassification.from_pretrained("result/checkpoint-500",num_labels=6)
emotions = load_dataset("diting")
def tokenize(batch):
return tokenizer(batch["text"],padding=True,truncation=True)
tokenized_emotions = emotions.map(tokenize,batched=True,batch_size=None)
tokenized_emotions.set_format("torch",columns=["input_ids","attention_mask","label"])
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
f1 = f1_score(labels,preds,average="weighted")
acc = accuracy_score(labels,preds)
return {"accuracy":acc,"f1":f1}
training_args = TrainingArguments(output_dir="result",)
trainer = Trainer(model=model,args=training_args,compute_metrics=compute_metrics,
train_dataset=tokenized_emotions["train"]
,eval_dataset=tokenized_emotions["validation"])
results = trainer.evaluate()
print(results)
结果:
{
'eval_loss': 0.24527376890182495,
'eval_accuracy': 0.923,
'eval_f1': 0.9251701895610343,
'eval_runtime': 26.781,
'eval_samples_per_second': 74.68,
'eval_steps_per_second': 1.195
}
-
eval_loss
:在验证集上计算的损失值。 -
eval_accuracy
:在验证集上计算的准确率。 -
eval_f1
:在验证集上计算的F1分数。 -
eval_runtime
:模型在验证集上运行的总时间。 -
eval_samples_per_second
:模型在验证集上处理的样本数每秒。 -
eval_steps_per_second
:模型在验证集上处理的步骤数每秒。
F1分数
F1分数的取值范围在0~1之间,值越大表示模型的性能越好。通常情况下,F1分数的区间可以分为以下几个等级:
- F1分数 < 0.1,表示模型的性能非常差,预测结果基本没有准确性可言;
- 0.1 < F1分数 < 0.3,表示模型的性能比较差,预测结果有一定的准确性,但还有很大的提升空间;
- 0.3 < F1分数 < 0.5,表示模型的性能一般,预测结果的准确性有一定的保障,但还有进步的空间;
- 0.5 < F1分数 < 0.7,表示模型的性能较好,预测结果的准确性比较高,但还有进一步提升的空间;
- 0.7 < F1分数 < 0.9,表示模型的性能很好,预测结果的准确性非常高,但还有进一步提升的空间;
- F1分数 > 0.9,表示模型的性能非常优秀,预测结果的准确性非常接近完美。
预测:
代码:
from datasets import load_dataset
from sklearn.metrics import accuracy_score,f1_score
from transformers import Trainer,TrainingArguments
from transformers import AutoModelForSequenceClassification,AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("deberta-v3-large")
model = AutoModelForSequenceClassification.from_pretrained("result/checkpoint-500",num_labels=6)
emotions = load_dataset("emotion")
def tokenize(batch):
return tokenizer(batch["text"],padding=True,truncation=True)
tokenized_emotions = emotions.map(tokenize,batched=True,batch_size=None)
tokenized_emotions.set_format("torch",columns=["input_ids","attention_mask","label"])
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
f1 = f1_score(labels,preds,average="weighted")
acc = accuracy_score(labels,preds)
return {"accuracy":acc,"f1":f1}
training_args = TrainingArguments(output_dir="result",)
trainer = Trainer(model=model,args=training_args,compute_metrics=compute_metrics,
train_dataset=tokenized_emotions["train"]
,eval_dataset=tokenized_emotions["validation"])
texts = ["This is a good movie.", "I do not like this product."]
max_length = 512
predictor = lambda text: tokenizer(text, padding=True, truncation=True)
inputs = list(map(predictor, texts))
predictions = trainer.predict(tokenized_emotions["test"])
predicted_labels = predictions.predictions.argmax(-1).tolist()
print(predicted_labels)
结果:
[0, 0, 0, 1, 0, 4, 4, 2, 1, 3, 4, 0, 4, 1, 2, 0, 1, 0, 3, 1, 2, 1, 1, 4, 0, 4, 3, 0, 4, 3, 4, 3, 0, 3, 0, 1, 1, 0, 1, 1, 3, 0, 1, 0, 1, 3, 1, 1, 4, 4, 0, 4, 1, 0, 1, 0, 0, 1, 0, 3, 0, 0, 1, 1, 0, 5, 0, 4, 4, 5, 1, 2, 5, 2, 2, 3, 1, 0, 1, 2, 1, 3, 0, 1, 0, 0, 2, 1, 1, 0, 1, 4, 3, 4, 3, 3, 2, 0, 4, 0, 0, 0, 0, 4, 3, 3, 1, 1, 5, 0, 1, 2, 4, 1, 0, 1, 1, 4, 0, 1, 0, 2, 0, 3, 0, 2, 0, 4, 0, 0, 1, 2, 0, 3, 3, 1, 4, 4, 0, 1, 1, 0, 4, 1, 1, 0, 1, 4, 4, 2, 4, 2, 4, 0, 1, 0, 1, 1, 3, 0, 3, 3, 1, 4, 4, 1, 2, 2, 2, 0, 2, 3, 1, 1, 0, 3, 1, 1, 0, 0, 4, 1, 0, 2, 4, 0, 1, 1, 5, 3, 1, 0, 1, 3, 0, 0, 4, 4, 1, 1, 1, 2, 2, 1, 1, 1, 2, 4, 4, 1, 3, 0, 1, 4, 1, 0, 3, 0, 3, 3, 1, 4, 5, 1, 1, 1, 3, 1, 2, 4, 0, 0, 5, 1, 4, 1, 0, 1, 0, 1, 0, 2, 4, 1, 0, 0, 0, 3, 1, 5, 0, 4, 0, 3, 2, 0, 1, 1, 0, 3, 3, 3, 2, 3, 3, 2, 1, 1, 3, 0, 3, 3, 4, 0, 3, 0, 1, 4, 3, 0, 1, 0, 0, 1, 1, 1, 2, 0, 1, 1, 5, 4, 2, 0, 2, 5, 3, 0, 0, 3, 1, 3, 2, 2, 2, 4, 4, 2, 4, 4, 1, 0, 4, 1, 3, 3, 2, 5, 4, 4, 0, 0, 4, 1, 0, 0, 2, 3, 0, 0, 4, 0, 0, 2, 0, 2, 4, 2, 1, 0, 3, 2, 5, 1, 1, 1, 1, 4, 4, 1, 0, 0, 0, 2, 1, 2, 2, 0, 0, 1, 3, 0, 0, 3, 1, 1, 2, 0, 3, 2, 0, 1, 1, 3, 1, 0, 2, 0, 3, 0, 4, 3, 5, 4, 1, 3, 3, 4, 1, 0, 0, 2, 0, 0, 4, 0, 3, 1, 0, 4, 4, 1, 5, 0, 2, 1, 1, 0, 2, 0, 4, 3, 1, 1, 3, 5, 4, 0, 4, 4, 1, 5, 4, 1, 0, 1, 0, 1, 3, 4, 3, 0, 4, 4, 4, 0, 2, 3, 3, 0, 1, 5, 3, 0, 0, 1, 0, 0, 1, 3, 0, 3, 0, 0, 4, 0, 2, 5, 1, 3, 4, 1, 0, 0, 0, 1, 2, 0, 1, 5, 4, 0, 1, 1, 3, 1, 4, 3, 1, 4, 0, 0, 3, 1, 1, 0, 4, 3, 1, 2, 1, 0, 1, 4, 1, 1, 2, 0, 1, 4, 3, 4, 4, 1, 1, 1, 0, 3, 0, 0, 1, 1, 1, 2, 4, 1, 1, 1, 1, 2, 1, 3, 0, 3, 1, 0, 4, 0, 3, 2, 1, 2, 1, 1, 1, 2, 4, 4, 1, 0, 2, 1, 2, 5, 0, 3, 2, 1, 1, 0, 1, 0, 5, 0, 1, 1, 0, 3, 1, 1, 1, 3, 0, 1, 4, 0, 0, 0, 0, 1, 1, 0, 1, 0, 2, 2, 2, 3, 4, 1, 1, 1, 0, 1, 0, 4, 4, 0, 0, 1, 4, 0, 0, 2, 0, 2, 5, 1, 0, 4, 0, 3, 0, 3, 3, 0, 1, 1, 2, 1, 3, 0, 1, 5, 2, 2, 0, 1, 1, 1, 0, 2, 1, 4, 3, 4, 4, 2, 1, 2, 2, 1, 4, 1, 0, 1, 1, 2, 1, 2, 1, 0, 1, 2, 1, 1, 1, 1, 3, 0, 2, 1, 0, 1, 3, 1, 0, 3, 4, 1, 1, 0, 3, 0, 3, 3, 4, 0, 4, 0, 2, 5, 1, 1, 1, 1, 3, 0, 1, 0, 0, 5, 1, 0, 0, 5, 0, 4, 3, 0, 0, 0, 4, 0, 4, 1, 0, 4, 3, 1, 3, 1, 5, 0, 0, 4, 1, 1, 1, 0, 0, 3, 0, 2, 4, 2, 1, 4, 3, 1, 1, 4, 1, 1, 3, 0, 4, 0, 1, 2, 1, 0, 0, 1, 0, 2, 4, 4, 3, 3, 1, 1, 1, 0, 1, 1, 2, 1, 0, 0, 0, 1, 1, 0, 4, 0, 3, 3, 1, 2, 1, 1, 4, 1, 0, 2, 2, 2, 0, 3, 1, 1, 3, 1, 0, 1, 0, 4, 1, 1, 4, 1, 1, 3, 0, 1, 1, 1, 1, 2, 1, 0, 4, 1, 1, 5, 0, 1, 0, 1, 1, 1, 2, 0, 0, 1, 0, 0, 1, 0, 3, 0, 1, 5, 1, 2, 1, 0, 0, 1, 4, 2, 4, 3, 4, 1, 0, 0, 0, 0, 4, 1, 5, 2, 4, 0, 3, 0, 0, 0, 1, 0, 1, 1, 5, 2, 0, 0, 4, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 4, 0, 0, 1, 0, 0, 1, 0, 2, 1, 4, 0, 0, 1, 5, 0, 1, 1, 5, 4, 3, 0, 4, 0, 1, 2, 0, 2, 1, 2, 1, 3, 1, 2, 2, 1, 5, 0, 1, 3, 2, 5, 1, 0, 3, 0, 3, 0, 2, 0, 1, 0, 1, 1, 0, 4, 2, 0, 0, 1, 0, 4, 2, 3, 1, 0, 4, 1, 1, 3, 1, 0, 3, 1, 0, 2, 3, 1, 1, 1, 4, 4, 0, 0, 5, 0, 0, 1, 0, 0, 1, 1, 0, 4, 1, 1, 2, 1, 2, 2, 0, 0, 1, 1, 0, 1, 3, 3, 4, 3, 1, 1, 3, 0, 1, 3, 4, 2, 3, 1, 0, 2, 0, 3, 2, 1, 1, 1, 0, 0, 0, 2, 1, 2, 4, 5, 2, 1, 0, 2, 1, 3, 5, 3, 1, 4, 3, 0, 1, 4, 1, 1, 2, 0, 2, 0, 2, 4, 1, 0, 3, 0, 1, 4, 4, 4, 4, 1, 0, 1, 3, 0, 3, 0, 1, 1, 4, 4, 1, 4, 1, 1, 0, 1, 0, 3, 1, 5, 5, 1, 5, 1, 3, 5, 1, 0, 1, 0, 2, 0, 2, 5, 1, 2, 0, 1, 3, 1, 1, 5, 0, 2, 1, 3, 0, 0, 1, 3, 3, 0, 2, 3, 2, 0, 3, 0, 2, 0, 1, 4, 3, 2, 4, 0, 4, 0, 3, 0, 3, 5, 3, 0, 0, 4, 1, 0, 1, 4, 3, 0, 2, 4, 4, 1, 0, 1, 2, 0, 3, 0, 3, 1, 2, 0, 0, 1, 0, 1, 1, 3, 0, 1, 4, 0, 0, 1, 0, 3, 1, 1, 3, 1, 0, 3, 1, 1, 1, 1, 3, 3, 0, 1, 0, 2, 0, 1, 1, 3, 1, 1, 1, 4, 1, 3, 2, 0, 0, 4, 1, 1, 1, 4, 4, 1, 0, 1, 3, 2, 1, 4, 5, 1, 1, 4, 4, 0, 5, 0, 0, 4, 0, 0, 0, 2, 0, 0, 0, 4, 1, 3, 0, 5, 0, 0, 5, 2, 2, 0, 1, 1, 0, 0, 0, 0, 1, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 3, 1, 5, 3, 0, 0, 0, 1, 5, 0, 0, 0, 1, 2, 0, 1, 1, 0, 0, 1, 3, 4, 0, 1, 4, 3, 4, 0, 1, 4, 1, 0, 1, 1, 4, 1, 1, 1, 0, 1, 1, 5, 1, 5, 1, 1, 4, 4, 1, 3, 4, 1, 0, 0, 1, 4, 0, 1, 0, 1, 0, 0, 4, 0, 4, 0, 0, 3, 3, 2, 1, 1, 0, 4, 5, 1, 5, 1, 1, 1, 4, 1, 5, 1, 1, 1, 0, 3, 3, 4, 4, 1, 1, 4, 0, 3, 1, 0, 2, 1, 1, 2, 0, 0, 1, 0, 0, 0, 2, 0, 1, 1, 0, 1, 2, 2, 1, 4, 1, 1, 3, 1, 3, 4, 1, 1, 2, 1, 2, 1, 1, 1, 4, 0, 0, 4, 0, 1, 2, 3, 0, 1, 0, 0, 3, 2, 0, 1, 1, 2, 4, 4, 4, 1, 2, 2, 1, 1, 1, 1, 1, 0, 5, 2, 0, 3, 4, 3, 1, 3, 2, 4, 0, 3, 1, 1, 1, 4, 1, 1, 4, 3, 0, 0, 1, 2, 1, 3, 4, 4, 0, 0, 3, 4, 3, 1, 1, 1, 0, 0, 1, 0, 1, 0, 2, 2, 0, 0, 1, 4, 4, 0, 1, 5, 0, 2, 1, 1, 1, 5, 3, 0, 1, 0, 0, 1, 2, 1, 1, 2, 0, 2, 4, 5, 1, 4, 1, 4, 3, 0, 1, 1, 0, 0, 2, 4, 2, 1, 1, 1, 0, 0, 3, 5, 1, 1, 2, 0, 1, 3, 4, 1, 0, 1, 4, 2, 1, 3, 1, 1, 1, 0, 0, 0, 1, 3, 4, 4, 5, 0, 1, 1, 1, 1, 1, 4, 1, 0, 1, 0, 0, 3, 2, 1, 1, 0, 4, 1, 3, 4, 1, 0, 4, 3, 1, 1, 1, 1, 0, 2, 3, 3, 1, 2, 0, 0, 0, 1, 1, 0, 5, 1, 0, 0, 3, 0, 2, 1, 4, 4, 1, 4, 0, 3, 1, 0, 1, 0, 3, 2, 4, 1, 0, 4, 4, 4, 0, 1, 0, 0, 0, 2, 3, 2, 3, 0, 2, 2, 1, 3, 2, 4, 1, 0, 0, 1, 3, 5, 0, 1, 3, 0, 1, 0, 4, 2, 0, 4, 4, 1, 3, 3, 1, 0, 4, 1, 4, 1, 1, 0, 0, 1, 4, 3, 3, 1, 3, 3, 1, 0, 4, 4, 5, 1, 1, 5, 1, 1, 1, 3, 2, 4, 0, 0, 1, 1, 1, 3, 2, 0, 1, 4, 3, 4, 1, 4, 0, 0, 0, 0, 1, 0, 0, 0, 0, 3, 0, 0, 1, 1, 1, 4, 1, 3, 0, 5, 3, 3, 2, 3, 0, 1, 2, 2, 1, 1, 1, 1, 0, 3, 1, 1, 1, 3, 0, 0, 0, 3, 1, 2, 0, 1, 0, 1, 0, 1, 2, 2, 3, 1, 3, 0, 2, 0, 4, 1, 4, 1, 0, 5, 1, 3, 1, 3, 1, 1, 0, 2, 3, 2, 2, 1, 5, 4, 4, 3, 3, 2, 4, 3, 0, 4, 4, 2, 1, 5, 4, 0, 1, 0, 0, 1, 1, 0, 2, 1, 1, 0, 4, 1, 4, 1, 2, 2, 1, 1, 0, 1, 1, 0, 2, 1, 5, 5, 1, 1, 0, 0, 4, 0, 1, 3, 4, 4, 1, 0, 4, 0, 0, 5, 0, 1, 3, 0, 3, 5, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 4, 3, 4, 4, 0, 0, 0, 1, 3, 1, 2, 3, 0, 0, 1, 3, 1, 1, 5, 0, 0, 4, 1, 1, 3, 2, 1, 3, 2, 4, 0, 1, 1, 0, 5, 1, 2, 1, 0, 1, 1, 2, 1, 2, 1, 4, 1, 0, 4, 1, 3, 1, 0, 2, 3, 1, 0, 0, 1, 0, 3, 1, 1, 4, 1, 1, 4, 1, 4, 4, 0, 0, 1, 0, 2, 1, 0, 1, 0, 2, 1, 2, 3, 1, 3, 0, 2, 0, 0, 0, 1, 2, 0, 0, 2, 0, 1, 0, 0, 3, 0, 3, 4, 5, 4, 3, 2, 1, 1, 3, 0, 0, 1, 1, 0, 0, 1, 5, 0, 0, 0, 3, 0, 1, 0, 1, 1, 4, 2, 0, 3, 1, 0, 1, 4, 3, 1, 2, 3, 0, 0, 1, 0, 1, 4, 1, 0, 0, 5, 5, 1, 2, 0, 2, 1, 1, 0, 0, 1, 1, 0, 2, 2, 1, 1, 1, 3, 4, 1, 1, 3, 2, 0, 3, 3, 0, 3, 1, 4, 0, 0, 0, 2, 2, 4, 3, 0, 3, 3, 1, 1, 5]
从结果来看,准确率还是很高的。
总结
1、通过transformor的Trainer快速的微调大模型
2、通过datasets的load_dataset非常方便的加载数据集,这里有个问题,数据集还是从网上下载的,其实load_dataset是可以从被动本地加载的,下篇博客专门讲一讲load_dataset
3、模型微调之后,直接使用checkpoint来加载模型
4、预测当前使用的是测试集,后续扩展从控制台输入,需要补充代码,后续博客给出文章来源:https://www.toymoban.com/news/detail-695513.html
5、有个警告需要解决,作为遗留问题:文章来源地址https://www.toymoban.com/news/detail-695513.html
python3.9/site-packages/transformers/convert_slow_tokenizer.py:470: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
到了这里,关于利用emotion数据集微调deberta-v3-large大模型的文本分类的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!