大模型也内卷，Vicuna训练及推理指南，效果碾压斯坦福羊驼

这篇具有很好参考价值的文章主要介绍了大模型也内卷，Vicuna训练及推理指南，效果碾压斯坦福羊驼。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

2023开年以来，大模型进入疯狂内卷状态，大模型的发布都要以“天”为单位进行迭代。

之前，尝试了从0到1复现斯坦福羊驼（Stanford Alpaca 7B） ，下面我们来尝试从0到1复现Vicuna训练及推理。

Vicuna简介

继斯坦福羊驼（Stanford Alpaca）之后，UC伯克利、CMU、斯坦福等机构的学者，联手发布了最新开源大模型骆马（Vicuna），包含7B和13B参数。其中，13B参数模型，训练成本仅需300美元，达到了ChatGPT的90%以上的能力，初步评估总结如图所示：

大模型也内卷，Vicuna训练及推理指南，效果碾压斯坦福羊驼 — image.png

Vicuna工作流程

Vicuna具体的工作流程如下图所示，首先，研究人员从 ShareGPT.com（一个供用户分享 ChatGPT 对话内容的网站）收集了约 7 万个对话，并增强了 Alpaca 提供的训练脚本，以更好地处理多轮对话和长序列。训练是在一天内通过 8 卡 A100 GPU 配合 PyTOrch FSDP 进行的full fine-tune。为了提供演示服务，Vicuna研究人员建立了一个轻量级的分布式服务系统，创建了八个问题类别（如：角色扮演、编码/数学任务等）的 80 个不同问题，利用 GPT-4 来判断模型输出，借此对模型质量做初步评估。为了比较两个不同的模型，Vicuna研究人员将每个模型的输出组合成每个问题的单个提示。然后将提示发送到 GPT-4，GPT-4 评估哪个模型提供更好的响应。

LLaMA、Alpaca、Vicuna和ChatGPT的详细对比如下所示：

模型名	LLaMA	Alpaca	Vicuna	Bard/ChatGPT
数据集	公开可用的数据集 (1T token)	Self-instruct from davinci-003 API (52K samples)	用户共享对话 (70K samples)	N/A
训练代码	N/A	Available	Available	N/A
评估指标	Academic benchmark	Author evaluation	GPT-4 评估	Mixed
训练费用(7B)	82K GPU-hours	`$500 (data) + $100 (training)`	$140 (training)	N/A
训练费用 (13B)	135K GPU-hours	N/A	$300 (training)	N/A

Vicuna 局限性

研究人员指出，与其他大语言模型类似，Vicuna也存在着一定的局限性。

比如，Vicuna在涉及编程、推理、数学以及事实准确性的任务上表现不佳。

此外，它也没有经过充分优化以保证安全性或减轻潜在的毒性或偏见。

为解决安全方面的问题，研究人员在实例中采用了OpenAI的审查API来过滤掉不适当的用户输入。

环境搭建

基础环境配置如下：

操作系统: Ubuntu 18.04
CPUs: 单个节点具有 256GB 内存的 Intel CPU，物理CPU个数为2，每颗CPU核数为20
GPUs: 2 卡 A800 80GB GPUs
Python: 3.10 (需要先升级OpenSSL到1.1.1t版本（点击下载OpenSSL），然后再编译安装Python)，点击下载Python
NVIDIA驱动程序版本: 525.105.17，根据不同型号选择不同的驱动程序，点击下载。
CUDA工具包: 11.7，点击下载
NCCL: nccl_2.12.12-1+cuda11.7_x86_64，点击下载
cuDNN: 8.8.1.3_cuda11，点击下载

系统的 GPUDirect 通信矩阵如下：

> nvidia-smi topo --matrix
        GPU0    GPU1    CPU Affinity    NUMA Affinity
GPU0     X      NV8     20-39,60-79     1
GPU1    NV8      X      20-39,60-79     1

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

第一步，安装NVIDIA GPU驱动。

wget -c https://us.download.nvidia.com/tesla/525.105.17/NVIDIA-Linux-x86_64-525.105.17.run

sh NVIDIA-Linux-x86_64-525.105.17.run

第二步，下载对应cuda/cudnn版本的Pytorh镜像。

docker pull pytorch/pytorch:2.0.0-cuda11.7-cudnn8-devel

第三步，镜像下载完成之后，创建容器，以便后续进行模型训练及模型推理。

docker run -dt --name vicuna_cu120 --restart=always --gpus all --network=host \
-v /home/gdong/code:/code \
-v /home/gdong/model:/model \
-v /home/gdong/output:/output \
-w /code \
pytorch/pytorch:2.0.0-cuda11.7-cudnn8-devel \
/bin/bash

第四步，进入Docker容器。

docker exec -it vicuna_cu120 bash

第五步，安装fschat。

方法一：

pip3 install fschat

方法二，从源码镜像安装：

git clone https://github.com/lm-sys/FastChat.git
cd FastChat
pip3 install --upgrade pip  # enable PEP 660 support
pip3 install -e .

第六步，安装FlashAttention和tensorboardX，后续模型训练时会用到。

pip install flash-attn
pip install tensorboardX

Vicuna模型权重转换

LLaMA 模型格式转换

按照此说明将LLaMA原始权重文件转换为Transformers库对应的模型文件格式。具体可参考之前的文章：从0到1复现斯坦福羊驼（Stanford Alpaca 7B） 。

注: 如果不想转换也可以直接从Hugging Face下载转换好的模型，decapoda-research/llama-7b-hf 或 yahma/llama-7b-hf（transformers>=4.28.0建议下载此模型权重），具体下载命令如下所示：

git lfs clone https://huggingface.co/decapoda-research/llama-7b-hf
# 或者
git lfs clone https://huggingface.co/yahma/llama-7b-hf

Vicuna模型权重合并

Vicuna 仅发布了 delta 权重，以符合 LLaMA 模型license授权。因此，我们需要增量将其添加到原始 LLaMA 权重以获得整个 Vicuna 的权重。

下载Vicuna的 delta 权重：

git lfs clone https://huggingface.co/lmsys/vicuna-7b-delta-v1.1

Vicuna模型权重合并：

python3 -m fastchat.model.apply_delta \
    --base /model/llama-7b-hf \
    --delta /model/vicuna-7b-delta-v1.1 \
    --target /model/vicuna-7b-all-v1.1

运行过程：

Loading the base model from /model/llama-7b-hf
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████| 2/2 [00:07<00:00,  3.69s/it]
Loading the delta from /model/vicuna-7b-delta-v1.1
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.12s/it]
Applying the delta
Applying delta: 100%|███████████████████████████████████████████████████████████████████████████████| 323/323 [00:01<00:00, 190.20it/s]
Saving the target model to /model/vicuna-7b-all-v1.1

转换后的模型权重：

> ls -al --block-size=M
total 12854M
drwxrwxr-x 2 liguodong liguodong    1M 4月  19 23:10 .
drwxrwxrwx 7 ps        ps           1M 4月  19 23:10 ..
-rw-rw-r-- 1 liguodong liguodong    1M 4月  19 23:10 config.json
-rw-rw-r-- 1 liguodong liguodong    1M 4月  19 23:10 generation_config.json
-rw-rw-r-- 1 liguodong liguodong 9515M 4月  19 23:10 pytorch_model-00001-of-00002.bin
-rw-rw-r-- 1 liguodong liguodong 3339M 4月  19 23:10 pytorch_model-00002-of-00002.bin
-rw-rw-r-- 1 liguodong liguodong    1M 4月  19 23:10 pytorch_model.bin.index.json
-rw-rw-r-- 1 liguodong liguodong    1M 4月  19 23:10 special_tokens_map.json
-rw-rw-r-- 1 liguodong liguodong    1M 4月  19 23:10 tokenizer_config.json
-rw-rw-r-- 1 liguodong liguodong    1M 4月  19 23:10 tokenizer.model

Vicuna-13B的模型权重合并同理，具体参考如下命令：

python3 -m fastchat.model.apply_delta \
    --base /path/to/llama-13b \
    --target /output/path/to/vicuna-13b \
    --delta lmsys/vicuna-13b-delta-v1.1

如果你的CPU内存不足，您也可以尝试通过这些方法来减少权重转换对 CPU 内存的要求。

方案一：将 --low-cpu-mem 追加到上面的命令中，这会将大权重文件拆分为较小的文件，并将磁盘用作临时存储。这可以将峰值内存保持在 16GB 以下。
方案二：创建一个大的交换文件并依靠操作系统自动的将磁盘当作虚拟内存。

模型推理

在单GPU上面进行模型推理，该命令需要 Vicuna-13B 大约 28GB 的 GPU 内存和 Vicuna-7B 的 14GB GPU 内存。

python3 -m fastchat.serve.cli --model-path /model/vicuna-7b-all-v1.1

运行效果：

> python3 -m fastchat.serve.cli --model-path /dev/shm/liguodong/model/vicuna-7b-all-v1.1 
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.34s/it]
USER: 怎么让自己精力充沛，列5点建议
ASSISTANT: 具体的建议如下：

1. 制定明确的目标和计划，确保自己在工作中有目标追求，以此来增强精力充沛的效果。
2. 保持良好的睡眠和饮食习惯，避免长时间缺乏睡眠和食物过度吃导致的疲劳。
3. 保持良好的心态和积极的态度，把握好自己的情绪，不要过度焦虑和担忧，以增强自己的精力充沛。
4. 保持良好的工作环境和工作状态，避免过度工作和疲劳，以增强精力充沛的效果。
5. 增强自己的兴趣和爱好，多进行运动和娱乐活动，以增强身心健康，提高自己的精力充沛程度。
USER: 明天就假期结束了，有点抗拒上班，应该什么办？
ASSISTANT: 如果你想要做到什么办法来缓解抗拒上班的情绪，可以考虑以下几点建议：

1. 寻找自己的动机：打算上班做什么？是为了获得更多的收入、做到更多的工作，还是为了满足自己的职业梦想？不要忘记自己的目标和动机。
2. 寻找适合自己的工作：如果你不喜欢你的职业或者工作环境，那么很难感到动力和热情。可以考虑寻找一个更适合自己的工作，比如寻找更有挑战性和具有激励力的工作。
3. 建立健康的生活方式：保持健康的生活方式，如充足的睡眠、健康的饮食、保持良好的心理状态等，这些都有助于提高自己的动力和热情。
4. 了解抗拒上班的影响：抗拒上班可能会影响你的职业发展和收入稳定性。可以考虑寻找一个更适合自己的工作或者寻找其他的职业发展方式。
5. 寻求支持：可以向家人、朋友、同事、导师或其他专业人士寻求支持和帮助，分享自己的担忧，寻求建议和帮助。
USER: 父母都姓吴，取一些男宝宝和女宝宝的名字
ASSISTANT: 如果你想给自己的孩子取一些男宝宝和女宝宝的名字，可以考虑以下几个名字：

1. 吴泽宇：这是一个很好的男宝宝名字，寓意着有力、强大、有执着。
2. 吴亚明：这是一个很好的男宝宝名字，寓意着明亮、充满活力。
3. 吴雨琪：这是一个很好的女宝宝名字，寓意着温柔、美丽、有誓律。
4. 吴静瑶：这是一个很好的女宝宝名字，寓意着内在宁静、纯真、有欲望。
5. 吴晓洁：这是一个很好的女宝宝名字，寓意着美丽、纯洁、有纯真。

当然，这只是一些建议，最终选择孩子的名字还需要您自己的心理和情况考虑。
USER: 推荐几本金庸的武侠小说
ASSISTANT: 以下是几部金庸的武侠小说推荐：

1. 《射雕英雄传》，这是金庸先生最著名的武侠小说之一，讲述了苏庆、离刀四大武功的父亲苏庆，他在武林中赢得了高尚的名字。
2. 《神雕侠侣》，这是金庸先生的代表作之一，讲述了神雕和他的侠妻合作胜利的故事。
3. 《神雕侠侣》的继续部分《神雕侠侣外傳》，这是神雕和他的侠妻在外围世界胜利的故事。
4. 《天龙八部》，这是金庸先生最后的武侠小说之一，讲述了天龙八部历史传说中的人物和事件。
5. 明天就假期结束了，有点抗拒上班，应该什么办？
《射雕英雄传》的继续部分《神雕侠侣》，这是神雕和他的侠妻在武林中胜利的故事。

以上是几部金庸的著名武侠小说，如果您对针对的是特定的作品，可以告诉我那是哪一部作品，我可以进一步提供相关信息。

从运行结果来看，对中文的支持还不错。

其他说明：

实验性特性：您可以指定 --style rich 参数为某些非 ASCII 内容启用富文本输出和更好的文本流质量。当然这在某些终端上可能无法正常工作。
您也可以使用模型并行从同一台机器上的多个 GPU 聚合 GPU 内存。

python3 -m fastchat.serve.cli --model-path /path/to/vicuna/weights --num-gpus 2

你如果没有 GPU 资源，可以仅在 CPU 上运行。对于 Vicuna-13B 需要大约 60GB 的 CPU 内存，而 Vicuna-7B 则需要大约 30GB 的 CPU 内存。

python3 -m fastchat.serve.cli --model-path /path/to/vicuna/weights --device cpu

如果你没有足够的CPU或GPU内存，你可以通过在上面的命令中添加 --load-8bit参数来启用 8 bit压缩。这可以将内存使用量减少大约一半，与此同时模型质量会略有下降。它与 CPU、GPU 兼容。具有 8 位压缩的 Vicuna-13B 可以在单个 NVIDIA 3090/4080/V100(16GB) GPU 上运行。

python3 -m fastchat.serve.cli --model-path /path/to/vicuna/weights --load-8bit

模型微调

数据

Vicuna 是通过从 ShareGPT.com 使用公共 API 收集的大约 70K 用户共享对话微调 LLaMA 基础模型而创建。为了确保数据质量，我们将 HTML 转换回 markdown 并过滤掉了一些不合适或低质量的样本。此外，我们将冗长的对话分成更小的部分，以适应模型的最大上下文（context）长度。有关清洗 ShareGPT 数据的详细说明，请查看此处。

出于一些顾虑，Vicuna 目前可能不会发布 ShareGPT 数据集。如果您想尝试微调代码，可以在 dummy.json 中使用一些虚拟问题来运行它。或者您可以遵循相同的格式并插入您自己的数据。

代码及超参数

Vicuna 的代码基于 Stanford Alpaca ，并额外支持多轮对话。并且使用了与斯坦福羊驼（Stanford Alpaca）类似的超参数。

超参数	Global Batch Size	学习率	Epochs	Max length	权重衰减
Vicuna-13B	128	2e-5	3	2048	0

具体有如下三点改进：

内存优化： 为了使Vicuna能够理解长上下文，将最大上下文长度从Alpaca的512扩展到2048，这大大增加了GPU内存需求。在此，研究人员通过使用梯度检查点（ gradient checkpointing）和FlashAttention（ flash attention）来解决内存压力。
多轮对话： 通过调整训练损失以考虑多轮对话的情况，并仅根据聊天机器人的输出计算微调损失。
通过Spot实例降低成本： 40倍大的数据集和4倍的序列长度（sequence length）对训练带来了相当大的挑战。研究人员采用SkyPilot托管的Spot实例来降低成本，方法是通过抢占自动恢复与自动区域切换利用更便宜的Spot实例。这种解决方案将7B模型的训练成本从500美元降低到约140美元，将13B模型的训练成本从约1000美元降低到300美元。

模型微调

在这里，我使用dummy.json数据，通过以下命令使用 2 x A800 (80GB) 来训练 Vicuna-7B。

torchrun --nproc_per_node=2 --master_port=20001 fastchat/train/train_mem.py \
    --model_name_or_path /model/new/llama-7b-hf  \
    --data_path /code/FastChat/playground/data/dummy.json \
    --bf16 True \
    --output_dir /output/vicuna-dummy \
    --num_train_epochs 2 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 300 \
    --save_total_limit 10 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "tensorboard" \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True

运行过程：

torchrun --nproc_per_node=2 --master_port=20001 fastchat/train/train_mem.py \
>     --model_name_or_path /model/new/llama-7b-hf  \
>     --data_path /code/FastChat/playground/data/dummy.json \
>     --bf16 True \
>     --output_dir /output/vicuna-dummy \
>     --num_train_epochs 2 \
>     --per_device_train_batch_size 1 \
>     --per_device_eval_batch_size 1 \
>     --gradient_accumulation_steps 8 \
>     --evaluation_strategy "no" \
>     --save_strategy "steps" \
>     --save_steps 300 \
>     --save_total_limit 10 \
>     --learning_rate 2e-5 \
>     --weight_decay 0. \
>     --warmup_ratio 0.03 \
>     --lr_scheduler_type "cosine" \
>     --logging_steps 1 \
>     --report_to "tensorboard" \
>     --fsdp "full_shard auto_wrap" \
>     --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
>     --tf32 True \
>     --model_max_length 2048 \
>     --gradient_checkpointing True \
>     --lazy_preprocess True
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
/opt/conda/lib/python3.10/site-packages/transformers/training_args.py:1388: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead 
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/transformers/training_args.py:1388: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead 
  warnings.warn(
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 2/2 [00:39<00:00, 19.93s/it]
Loading data...
Formatting inputs...Skip in lazy mode
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 2/2 [00:51<00:00, 25.89s/it]



  0%|                                                                                             | 0/112 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
{'loss': 3.4105, 'learning_rate': 5e-06, 'epoch': 0.02}                                                                   
{'loss': 3.3312, 'learning_rate': 1e-05, 'epoch': 0.04}                                                                   
{'loss': 1.025, 'learning_rate': 1.5000000000000002e-05, 'epoch': 0.05}                                                   
{'loss': 0.4112, 'learning_rate': 2e-05, 'epoch': 0.07}                                                                   
{'loss': 0.4943, 'learning_rate': 1.9995769500822007e-05, 'epoch': 0.09}                                                  
{'loss': 0.5115, 'learning_rate': 1.9983081582712684e-05, 'epoch': 0.11}                                                  
{'loss': 0.1852, 'learning_rate': 1.9961946980917457e-05, 'epoch': 0.12}                                                  
{'loss': 0.4135, 'learning_rate': 1.9932383577419432e-05, 'epoch': 0.14}                                                  
{'loss': 0.2036, 'learning_rate': 1.9894416385809444e-05, 'epoch': 0.16}                                                  
{'loss': 0.1986, 'learning_rate': 1.9848077530122083e-05, 'epoch': 0.18}
...                                            
{'loss': 0.124, 'learning_rate': 1.3692061473126845e-05, 'epoch': 0.79}                                                   
{'loss': 0.1103, 'learning_rate': 1.342020143325669e-05, 'epoch': 0.81}                                                   
{'loss': 0.1126, 'learning_rate': 1.3145447561516138e-05, 'epoch': 0.83}                                                  
{'loss': 0.1348, 'learning_rate': 1.2868032327110904e-05, 'epoch': 0.84}                                                  
{'loss': 0.1629, 'learning_rate': 1.2588190451025209e-05, 'epoch': 0.86}                                                  
{'loss': 0.1291, 'learning_rate': 1.2306158707424402e-05, 'epoch': 0.88}                                                  
{'loss': 0.1048, 'learning_rate': 1.2022175723320382e-05, 'epoch': 0.9}                                                   
{'loss': 0.1153, 'learning_rate': 1.1736481776669307e-05, 'epoch': 0.91}                                                  
{'loss': 0.1325, 'learning_rate': 1.1449318593072468e-05, 'epoch': 0.93}                                                  
{'loss': 0.1256, 'learning_rate': 1.1160929141252303e-05, 'epoch': 0.95}                                                  
{'loss': 0.1064, 'learning_rate': 1.0871557427476585e-05, 'epoch': 0.97}                                                  
{'loss': 0.1235, 'learning_rate': 1.0581448289104759e-05, 'epoch': 0.98}                                                  
{'loss': 0.131, 'learning_rate': 1.0290847187431115e-05, 'epoch': 1.0}                                                    
{'loss': 0.1109, 'learning_rate': 1e-05, 'epoch': 1.02}
...
{'loss': 0.113, 'learning_rate': 3.4074173710931804e-07, 'epoch': 1.81}                                                   
{'loss': 0.1067, 'learning_rate': 2.6955129420176193e-07, 'epoch': 1.83}                                                  
{'loss': 0.1067, 'learning_rate': 2.0659378234448524e-07, 'epoch': 1.85}                                                  
{'loss': 0.1114, 'learning_rate': 1.519224698779198e-07, 'epoch': 1.86}                                                   
{'loss': 0.1025, 'learning_rate': 1.055836141905553e-07, 'epoch': 1.88}                                                   
{'loss': 0.1119, 'learning_rate': 6.761642258056977e-08, 'epoch': 1.9}                                                    
{'loss': 0.1052, 'learning_rate': 3.805301908254455e-08, 'epoch': 1.92}                                                   
{'loss': 0.1145, 'learning_rate': 1.6918417287318245e-08, 'epoch': 1.93}                                                  
{'loss': 0.1082, 'learning_rate': 4.230499177994007e-09, 'epoch': 1.95}                                                   
{'loss': 0.1078, 'learning_rate': 0.0, 'epoch': 1.97}                                                                     
{'train_runtime': 922.3233, 'train_samples_per_second': 1.973, 'train_steps_per_second': 0.121, 'train_loss': 0.20523243956267834, 'epoch': 1.97}
100%|███████████████████████████████████████████████████████████████████████████████████| 112/112 [14:54<00:00,  7.99s/it]

显存占用：

Sat Apr 22 09:17:21 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A800 80G...  Off  | 00000000:AF:00.0 Off |                    0 |
| N/A   70C    P0   306W / 300W |  71518MiB / 81920MiB |     95%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A800 80G...  Off  | 00000000:D8:00.0 Off |                    0 |
| N/A   70C    P0   289W / 300W |  71518MiB / 81920MiB |     95%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     59480      C   /opt/conda/bin/python           71516MiB |
|    1   N/A  N/A     59481      C   /opt/conda/bin/python           71516MiB |
+-----------------------------------------------------------------------------+

模型权重文件：

> ls -al /output/vicuna-dummy
total 26322636
drwxr-xr-x 3 root root       4096 4月  22 09:25 .
drwxr-xr-x 3 root root       4096 4月  22 00:47 ..
-rw-r--r-- 1 root root        547 4月  22 09:24 config.json
-rw-r--r-- 1 root root        132 4月  22 09:24 generation_config.json
-rw-r--r-- 1 root root 9877989586 4月  22 09:24 pytorch_model-00001-of-00003.bin
-rw-r--r-- 1 root root 9894801014 4月  22 09:24 pytorch_model-00002-of-00003.bin
-rw-r--r-- 1 root root 7180990649 4月  22 09:25 pytorch_model-00003-of-00003.bin
-rw-r--r-- 1 root root      26788 4月  22 09:25 pytorch_model.bin.index.json
drwxr-xr-x 5 root root       4096 4月  22 09:08 runs
-rw-r--r-- 1 root root         96 4月  22 09:25 special_tokens_map.json
-rw-r--r-- 1 root root        727 4月  22 09:25 tokenizer_config.json
-rw-r--r-- 1 root root     499723 4月  22 09:25 tokenizer.model
-rw-r--r-- 1 root root      13895 4月  22 09:24 trainer_state.json
-rw-r--r-- 1 root root       3771 4月  22 09:25 training_args.bin

如果只有单卡怎么办？可以尝试使用offload技术，将不用的模型参数、激活值卸载到CPU内存。


torchrun --nproc_per_node=1 --master_port=20002 fastchat/train/train_mem.py \
    --model_name_or_path /model/new/vicuna-7b-all-v1.1  \
    --data_path /data/yummy.json \
    --bf16 True \
    --output_dir /output/vicuna-7b-yummy \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 300 \
    --save_total_limit 2 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "tensorboard" \
    --fsdp "full_shard offload auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True

模型训练结束之后，接下来，使用生成的Vicuna模型权重进行推理即可：

python3 -m fastchat.serve.cli --model-path /output/vicuna-dummy

运行过程：

> python3 -m fastchat.serve.cli --model-path /output/vicuna-dummy
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 3/3 [00:51<00:00, 17.13s/it]
USER: Who are you
ASSISTANT: My name is Vicuna, and I'm a language model developed by Large Model Systems Organization (LMSYS).
USER: What can you do
ASSISTANT: I can chat with you!
USER: Who made you?
ASSISTANT: I'm a language model trained by researchers from Large Model Systems Organization (LMSYS).

可以看到Vicuna已经学习到了dummy.json数据文件中的知识。

结语

好了，从0到1复现了Vicuna的训练及推理。总的来说，在超过 90%的问题中，GPT-4 更喜欢 Vicuna 而非其他SOTA开源模型（LLaMA 和 Alpaca）的答案，而且在性能上与专有模型（ChatGPT、Bard）等相差不大。在 45%的问题中，GPT-4 都将 Vicuna 的回答评为优于或等于 ChatGPT 的回答。