微调 Code Llama 完整指南

这篇具有很好参考价值的文章主要介绍了微调 Code Llama 完整指南。希望对大家有所帮助。如果存在错误或未考虑完全的地方,请大家不吝赐教,您也可以点击"举报违法"按钮提交疑问。

一、前言

今天这篇文章将向大家详细介绍如何对 Code Llama 进行微调,让它变成适合 SQL 开发的有利工具。对于编程开发任务,经过适当微调后的 Code Llama 的性能通常都会比普通的 Llama 强很多,特别是当我们针对具体任务进行优化时:

  • 使用b-mc2/sql-create-context这个文本查询及其对应的SQL查询集合进行训练

  • 使用Lora方法,将基础模型的权重量化为int8,冻结权重,仅对适配器进行训练

  • 本文大多参考了alpaca-lora项目,同时也进行了一定的改进与优化

通过上述几点方法,相信我们能使Code Llama专注于SQL开发领域,获得更好的效果。如果按照本指南步骤进行指导,相信您也能掌握微调的奥妙。

二、微调 Code Llama

2.1、安装依赖

我使用了一台配置了 Python 3.10 和 Cuda 11.8 的 A100 GPU 服务器来运行本文中的代码。大约运行了一个小时。(为了验证可移植性,我还试验在Colab上运行代码,效果都很好。)

!pip install git+https://github.com/huggingface/transformers.git@main bitsandbytes accelerate==0.20.3  # we need latest transformers for this
!pip install git+https://github.com/huggingface/peft.git@e536616888d51b453ed354a6f1e243fecb02ea08
!pip install datasets==2.10.1
import locale # colab workaround
locale.getpreferredencoding = lambda: "UTF-8" # colab workaround
!pip install wandb

2.2、加载库

from datetime import datetime
import os
import sys

import torch
from peft import (
    LoraConfig,
    get_peft_model,
    get_peft_model_state_dict,
    prepare_model_for_int8_training,
    set_peft_model_state_dict,
)
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq

(如果出现导入错误,请尝试重新启动 Jupyter 内核)

2.3、加载数据集

这将从 Huggingface Hub 中提取数据集,并将其中的 10% 分成评估集,以检查模型在训练中的表现如何:

from datasets import load_dataset
dataset = load_dataset("b-mc2/sql-create-context", split="train")
train_dataset = dataset.train_test_split(test_size=0.1)["train"]
eval_dataset = dataset.train_test_split(test_size=0.1)["test"]

如果您想加载自己的数据集,请执行以下操作:

train_dataset = load_dataset('json', data_files='train_set.jsonl', split='train')
eval_dataset = load_dataset('json', data_files='validation_set.jsonl', split='train')

如果您想查看数据集中的任何样本,只需执行以下操作:

print(train_dataset[3])

2.4、加载模型

我从 Huggingface 加载代码 llama int8(Lora 的标准):

base_model = "codellama/CodeLlama-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")

torch_dtype=torch.float16 表示使用 float16 表示形式执行计算,即使值本身是 8 位整数。

如果出现错误“ValueError:Tokenizer 类 CodeLlamaTokenizer 不存在或当前未导入。”确保你的 Transformer 版本是 4.33.0.dev0 并且accelerate是 >=0.20.3。

2.5、检查基础型号

检查模型是否已经可以做我们想要它做的事情:

eval_prompt = """You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.

You must output the SQL query that answers the question.
### Input:
Which Class has a Frequency MHz larger than 91.5, and a City of license of hyannis, nebraska?

### Context:
CREATE TABLE table_name_12 (class VARCHAR, frequency_mhz VARCHAR, city_of_license VARCHAR)

### Response:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))

输出结果:

SELECT * FROM table_name_12 WHERE class > 91.5 AND city_of_license = 'hyannis, nebraska'

如果输入只要求类,那么这显然是错误的,因此请继续进行微调!

2.6、Tokenization

设置一些标记化设置,例如左填充,因为它使训练使用更少的内存:

tokenizer.add_eos_token = True
tokenizer.pad_token_id = 0
tokenizer.padding_side = "left"

设置 tokenize 函数以使 labels 和 input_ids 相同。这基本上就是自我监督微调:

def tokenize(prompt):
    result = tokenizer(
        prompt,
        truncation=True,
        max_length=512,
        padding=False,
        return_tensors=None,
    )

    # "self-supervised learning" means the labels are also the inputs:
    result["labels"] = result["input_ids"].copy()

    return result

并运行将每个 data_point 转换为我在网上找到的效果很好的提示:

def generate_and_tokenize_prompt(data_point):
    full_prompt =f"""You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.

You must output the SQL query that answers the question.

### Input:
{data_point["question"]}

### Context:
{data_point["context"]}

### Response:
{data_point["answer"]}
"""
    return tokenize(full_prompt)

重新格式化以提示并将每个样本标记为我们的标记化训练和评估数据集:

tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)

2.7、设置 LoRA

置标准 Lora 配置并将其附加到基本模型:

model.train() # put model back into training mode
model = prepare_model_for_int8_training(model)

config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=[
    "q_proj",
    "k_proj",
    "v_proj",
    "o_proj",
],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)

要从检查点恢复,请将resumefromcheckpoint 设置为要从中恢复的adapter_model.bin 的路径:

resume_from_checkpoint = "" # set this to the adapter_model.bin file you want to resume from

if resume_from_checkpoint:
    if os.path.exists(resume_from_checkpoint):
        print(f"Restarting from {resume_from_checkpoint}")
        adapters_weights = torch.load(resume_from_checkpoint)
        set_peft_model_state_dict(model, adapters_weights)
    else:
        print(f"Checkpoint {resume_from_checkpoint} not found")

设置权重和偏差以查看训练图的可选内容:

wandb_project = "sql-try2-coder"
if len(wandb_project) > 0:
    os.environ["WANDB_PROJECT"] = wandb_project
if torch.cuda.device_count() > 1:
    # keeps Trainer from trying its own DataParallelism when more than 1 gpu is available
    model.is_parallelizable = True
    model.model_parallel = True

2.8、模型训练

如果 GPU 内存不足,请更改 perdevicetrainbatchsize。 gradientaccumulationsteps 变量应确保这不会影响训练运行期间的批量动态。所有其他变量都是标准的东西,不用设置:

batch_size = 128
per_device_train_batch_size = 32
gradient_accumulation_steps = batch_size // per_device_train_batch_size
output_dir = "sql-code-llama"

training_args = TrainingArguments(
        per_device_train_batch_size=per_device_train_batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        warmup_steps=100,
        max_steps=400,
        learning_rate=3e-4,
        fp16=True,
        logging_steps=10,
        optim="adamw_torch",
        evaluation_strategy="steps", # if val_set_size > 0 else "no",
        save_strategy="steps",
        eval_steps=20,
        save_steps=20,
        output_dir=output_dir,
        load_best_model_at_end=False,
        group_by_length=True, # group sequences of roughly the same length together to speed up training
        report_to="wandb", # if use_wandb else "none",
        run_name=f"codellama-{datetime.now().strftime('%Y-%m-%d-%H-%M')}", # if use_wandb else None,
    )

trainer = Trainer(
    model=model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    args=training_args,
    data_collator=DataCollatorForSeq2Seq(
        tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
    ),
)

然后我们进行一些与 pytorch 相关的优化,这只是使训练更快,但不影响准确性:

model.config.use_cache = False

old_state_dict = model.state_dict
model.state_dict = (lambda self, *_, **__: get_peft_model_state_dict(self, old_state_dict())).__get__(
    model, type(model)
)
if torch.__version__ >= "2" and sys.platform != "win32":
    print("compiling the model")
    model = torch.compile(model)
trainer.train()

此 ^ 将在 A100 上运行大约 1 小时。

2.9、加载最终检查点

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer

base_model = "codellama/CodeLlama-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")

要加载经过微调的 Lora/Qlora 适配器,请使用 PeftModel.frompretrained。 output_dir 应该是包含adapterconfig.json和adapter_model.bin的东西:

from peft import PeftModel
model = PeftModel.from_pretrained(model, output_dir)

尝试与之前相同的提示:

eval_prompt = """You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.

You must output the SQL query that answers the question.
### Input:
Which Class has a Frequency MHz larger than 91.5, and a City of license of hyannis, nebraska?

### Context:
CREATE TABLE table_name_12 (class VARCHAR, frequency_mhz VARCHAR, city_of_license VARCHAR)

### Response:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))

模型输出:

SELECT class FROM table_name_12 WHERE frequency_mhz > 91.5 AND city_of_license = "hyannis, nebraska"

从运行结果可以看到微调是有效果的!也可以将此适配器转换为 Llama.cpp 模型以在本地运行。

Jupyter Notebook 的完整代码

https://github.com/Crossme0809/frenzyTechAI/blob/main/fine-tune-code-llama/finetunecode_llama.ipynb

三、References

[1]. Alpaca-LoRA:

https://github.com/tloen/alpaca-lora

[2]. LoRA Paper:

https://arxiv.org/abs/2106.09685

[3]. Sql-Create-Context:

https://huggingface.co/datasets/b-mc2/sql-create-context

如果你对这篇文章感兴趣,而且你想要了解更多关于AI领域的实战技巧,可以关注「技术狂潮AI」公众号。在这里,你可以看到最新最热的AIGC领域的干货文章和案例实战教程。文章来源地址https://www.toymoban.com/news/detail-758414.html

到了这里,关于微调 Code Llama 完整指南的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处: 如若内容造成侵权/违法违规/事实不符,请点击违法举报进行投诉反馈,一经查实,立即删除!

领支付宝红包 赞助服务器费用

相关文章

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

博客赞助

微信扫一扫打赏

请作者喝杯咖啡吧~博客赞助

支付宝扫一扫领取红包,优惠每天领

二维码1

领取红包

二维码2

领红包