LLM实践-在Colab上使用免费T4 GPU进行Chinese-Llama-2-7b-4bit推理-Toy模板网

这篇具有很好参考价值的文章主要介绍了LLM实践-在Colab上使用免费T4 GPU进行Chinese-Llama-2-7b-4bit推理。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

一、配置环境

1、打开colab，创建一个空白notebook，在[修改运行时环境]中选择15GB显存的T4 GPU.

2、pip安装依赖python包

!pip install --upgrade accelerate
!pip install bitsandbytes transformers_stream_generator

!pip install transformers 
!pip install sentencepiece
!pip install torch
!pip install accelerate

注意此时，安装完accelerate后需要重启notebook，不然报如下错误：

ImportError: Using low_cpu_mem_usage=True or a device_map requires Accelerate: pip install accelerate

注：参考文章内容[1]不能直接运行

二、模型推理

运行加载模型代码

import accelerate
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer

# 待加载的预模型
model_path = "LinkSoul/Chinese-Llama-2-7b-4bit"

# 分词器
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(
        model_path,
        load_in_4bit=True,
        torch_dtype=torch.float16,
        device_map='auto'
    )
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)        

instruction = """[INST] <<SYS>>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

            If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\n<</SYS>>\n\n{} [/INST]"""

下载模型需要耗费一点时间

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Downloading (…)model.bin.index.json: 100%
26.8k/26.8k [00:00<00:00, 1.13MB/s]
Downloading shards: 0%
0/2 [00:00<?, ?it/s]
Downloading (…)l-00001-of-00002.bin: 100%
9.97G/9.98G [04:58<00:00, 38.5MB/s]
Downloading (…)l-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Downloading (…)neration_config.json: 100%
132/132 [00:00<00:00, 4.37kB/s]

demo1

prompt = instruction.format("What is the meaning of life")
generate_ids = model.generate(tokenizer(prompt, return_tensors='pt').input_ids.cuda(), max_new_tokens=4096, streamer=streamer)

输出：

/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1421: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration )
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/bitsandbytes/nn/modules.py:224: UserWarning: Input type into Linear4bit is torch.float16, but bnb_4bit_compute_type=torch.float32 (default). This will lead to slow inference or training speed.
  warnings.warn(f'Input type into Linear4bit is torch.float16, but bnb_4bit_compute_type=torch.float32 (default). This will lead to slow inference or training speed.')
  
The meaning of life is a philosophical question that has been debated for centuries. There is no one definitive answer, as different people and cultures may have different beliefs and values. Some people believe that the meaning of life is to seek happiness, while others believe that it is to fulfill a higher purpose or to serve a greater good. Ultimately, the meaning of life is a personal and subjective question that each individual must answer for themselves.

demo2

prompt = instruction.format("如何做个不拖延的人")
generate_ids = model.generate(tokenizer(prompt, return_tensors='pt').input_ids.cuda(), max_new_tokens=4096, streamer=streamer)

输出：

答案：不拖延的人是一个很好的目标，但是要成为一个不拖延的人并不容易。以下是一些建议，可以帮助你成为一个不拖延的人：

1. 制定计划：制定一个详细的计划，包括每天要完成的任务和时间表。这样可以帮助你更好地组织时间，并避免拖延。
2. 设定目标：设定个明确的目标，并制定一个实现这个目标的计划。这样可以帮助你更好地了解自己的目标，并更有动力地去完成任务。
3. 克服拖延的心理延的心理是一个常见的问题，但是可以通过一些方法克服。例如，你可以尝试使用一些技巧来克服拖延，如分解任务、使用时间管理工具等。
4. 坚持自己的计划：坚持自己的计划是非常重要的。如果你经常拖延，那么你需要坚持自己的计划，并尽可能地按照计划去完成任务5. 寻求帮助

三、参考链接

[1] Llama-2-7b-4bit推理 https://www.bilibili.com/read/cv25258378/
[2] 原始Kaggle Notebook链接：https://www.kaggle.com/code/tiansztianszs/chinese-llama-2-7b-4bit/notebook文章来源地址https://www.toymoban.com/news/detail-765485.html

到了这里，关于LLM实践-在Colab上使用免费T4 GPU进行Chinese-Llama-2-7b-4bit推理的文章就介绍完了。如果您还想了解更多内容，请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章，希望大家以后多多支持TOY模板网！