0. 背景
个人电脑配置实在难以以 float16 运行 Mixtral 8*7B 大语言模型,所以参数 4bit 或者 8bit 来启动。
实际测试结果,4bit 时推理速度明显变快了,8bit 时推理也非常慢。
使用的推理框架时 fastchat。
1. 修改代码
vi fastchat/model/model_adapter.py
修改前,
class MistralAdapter(BaseModelAdapter):
"""The model adapter for Mistral AI models"""
def match(self, model_path: str):
return "mistral" in model_path.lower() or "mixtral" in model_path.lower()
def load_model(self, model_path: str, from_pretrained_kwargs: dict):
model, tokenizer = super().load_model(model_path, from_pretrained_kwargs)
model.config.eos_token_id = tokenizer.eos_token_id
model.config.pad_token_id = tokenizer.pad_token_id
return model, tokenizer
修改后,文章来源:https://www.toymoban.com/news/detail-799733.html
class MistralAdapter(BaseModelAdapter):
"""The model adapter for Mistral AI models"""
def match(self, model_path: str):
return "mistral" in model_path.lower() or "mixtral" in model_path.lower()
def load_model(self, model_path: str, from_pretrained_kwargs: dict):
# model, tokenizer = super().load_model(model_path, from_pretrained_kwargs)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
if "mixtral" in model_path.lower():
model = AutoModelForCausalLM.from_pretrained(
model_path,
low_cpu_mem_usage=True,
trust_remote_code=True,
# attn_implementation="flash_attention_2",
# load_in_8bit=True,
load_in_4bit=True,
**from_pretrained_kwargs,
)
else:
model = AutoModelForCausalLM.from_pretrained(
model_path,
low_cpu_mem_usage=True,
trust_remote_code=True,
**from_pretrained_kwargs,
)
model.config.eos_token_id = tokenizer.eos_token_id
model.config.pad_token_id = tokenizer.pad_token_id
return model, tokenizer
完结!文章来源地址https://www.toymoban.com/news/detail-799733.html
到了这里,关于4bit/8bit 启动 Mixtral 8*7B 大语言模型的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!