LLM - Model Load_in_8bit For LLaMA-Toy模板网

这篇具有很好参考价值的文章主要介绍了LLM - Model Load_in_8bit For LLaMA。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

load_in_8bit,LLM,llama,8bit,量化

一.引言

LLM 量化是将大语言模型进行压缩和优化以减少其计算和存储需求的过程。

博主在使用 LLaMA-33B 时尝试使用量化加载模型，用传统 API 参数控制量化失败，改用其他依赖尝试成功。这里先铺下结论:

◆ Load_in_8bit ✔️

◆ Load_in_4bit ❌

二.LLaMA 量化尝试

1.Load_in_8bit By API ❌

    model = LlamaForCausalLM.from_pretrained(
        args.base_model,
        config=config,
        torch_dtype=compute_type,
        low_cpu_mem_usage=True,
        load_in_8bit=True,
        device_map='auto',
        **config_kwargs
    )

直接 load_in_8bit=True 报错：

load_in_8bit,LLM,llama,8bit,量化

下载 Accelerate 继续尝试：

accelerate==0.21.0

load_in_8bit,LLM,llama,8bit,量化

按提示添加 load_in_8bit_fp32_cpu_offload=True 试下还是报错：

load_in_8bit,LLM,llama,8bit,量化

2.Load_in_8Bit By BitsAndBytesConfig ✔️

    # 8-Bit 量化加载
    if args.quantization != "None":
        config_kwargs["load_in_8bit"] = True
        config_kwargs["quantization_config"] = BitsAndBytesConfig(
            load_in_8bit=True,
            llm_int8_threshold=6.0
        )
        logger.info("Quantization model to {} bit.".format(args.quantization))

    # 加载 Base Model || Merge Model
    model = LlamaForCausalLM.from_pretrained(
        args.base_model,
        config=config,
        torch_dtype=compute_type,
        low_cpu_mem_usage=True,
        **config_kwargs
    )
    print('Model Config ', model.config)

将 BitsAndBytesConfig 的配置传到 from_pretrained 内再次加载：

[模型加载前 Memory Usage: 3]
Tokenizer Load Success!
Config Load Success!
08/25/2023 11:31:23 - INFO - __main__ - Quantization model to 8 bit.
Loading checkpoint shards: 100%|██████████| 7/7 [01:53<00:00, 16.28s/it]
Model Config  LlamaConfig {
  "_name_or_path": "/models/Llama-33B",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 6656,
  "initializer_range": 0.02,
  "intermediate_size": 17920,
  "max_position_embeddings": 2048,
  "model_type": "llama",
  "num_attention_heads": 52,
  "num_hidden_layers": 60,
  "pad_token_id": 0,
  "quantization_config": {
    "llm_int8_enable_fp32_cpu_offload": false,
    "llm_int8_skip_modules": null,
    "llm_int8_threshold": 6.0,
    "load_in_8bit": true
  },
  "rms_norm_eps": 1e-06,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.29.1",
  "use_cache": true,
  "vocab_size": 49954
}

trainable params: 0 || all params: 32767947264 || trainable%: 0.0000
[模型加载后 Memory Usage: 34003]

LLaMA-33B 原始模型文件大小约 62G，8_bit 量化加载后显存占用约为 33G。

Tips：

这里量化配置中有一个 llm_int8_threshold 的参数，该参数控制模型推断时是否使用量化的阈值。量化是一种将模型的浮点权重和激活值转换为整数或低位宽浮点数的技术，以减少模型的内存占用和计算开销。具体来说，如果一个模型的权重或激活值在绝对值上小于 llm_int8_threshold，那么这些值将被量化为8位整形以减少内存使用。如果值的绝对值大于 llm_int8_threshold 则会继续一浮点数的形式存储，保留更多的精度。

这个参数的设置可能会影响模型的推断速度和精度。较低的阈值可能会导致更多的值被量化为整数，从而降低内存使用和加速推断，但可能会牺牲一些模型的精度。较高的阈值可能会保留更多的精度，但可能会增加内存使用和推断时间。具体的最佳阈值设置取决于模型的具体架构、任务、精度要求以及硬件环境等因素。

3.Load_in_4Bit By BitsAndBytesConfig ⚠️

除了 8_bit 量化外，还可以使用 BitsAndBytesConfig 进行 4_bit 量化，由于 LLaMA 只支持 8_bit 量化，所以这里只给出相关的 Load_in_4bit 配置，其他支持 4_bit 量化的模型大家可以用该配置尝试：

config_kwargs["load_in_4bit"] = True
config_kwargs["quantization_config"] = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16e,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

使用与 load_in_8bit 相同，将 config_kwargs 以 **config_kwargs 的形式传给 from_pretrained API 即可，除此还需满足下述 package 的版本：

require_version("bitsandbytes>=0.39.0", "To fix: pip install bitsandbytes>=0.39.0")
require_version("transformers>=4.30.1", "To fix: pip install transformers>=4.30.1")
require_version("accelerate>=0.20.3", "To fix: pip install accelerate>=0.20.3")
require_version("peft>=0.4.0.dev0", "To fix: pip install git+https://github.com/huggingface/peft.git")