【深度学习】SDXL tensorRT 推理，Stable Diffusion 转onnx，转TensorRT

这篇具有很好参考价值的文章主要介绍了【深度学习】SDXL tensorRT 推理，Stable Diffusion 转onnx，转TensorRT。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

sdxl 转 diffusers

juggernautXL_version6Rundiffusion.safetensors文件是pth pytroch文件，需要先转为diffusers 的文件结构。

def convert_sdxl_to_diffusers(pretrained_ckpt_path, output_diffusers_path):
    import os
    os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"  # 设置 HF 镜像源（国内用户使用）
    os.environ["CUDA_VISIBLE_DEVICES"] = "1"  # 设置 GPU 所使用的节点

    import torch
    from diffusers import StableDiffusionXLPipeline
    pipe = StableDiffusionXLPipeline.from_single_file(pretrained_ckpt_path, torch_dtype=torch.float16).to("cuda")
    pipe.save_pretrained(output_diffusers_path, variant="fp16")


if __name__ == '__main__':
    convert_sdxl_to_diffusers("/ssd/wangmiaojun/tensorRT_test/juggernautXL_version6Rundiffusion.safetensors",
                              "/ssd/wangmiaojun/tensorRT_test/mj_onnx")

FP16在后面不好操作，所以最好先是FP32:

```bash
def convert_sdxl_to_diffusers(pretrained_ckpt_path, output_diffusers_path):
    import os
    os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"  # 设置 HF 镜像源（国内用户使用）
    os.environ["CUDA_VISIBLE_DEVICES"] = "1"  # 设置 GPU 所使用的节点

    import torch
    from diffusers import StableDiffusionXLPipeline
    pipe = StableDiffusionXLPipeline.from_single_file(pretrained_ckpt_path).to("cuda")
    pipe.save_pretrained(output_diffusers_path)


if __name__ == '__main__':
    convert_sdxl_to_diffusers("/ssd/wangmiaojun/tensorRT_test/juggernautXL_version6Rundiffusion.safetensors",
                              "/ssd/wangmiaojun/tensorRT_test/mj_onnx32")

转onnx

有了diffusers 的文件结构，就可以转onnx文件。

项目：https://huggingface.co/docs/diffusers/optimization/onnx

pip install -q optimum["onnxruntime"]

optimum-cli export onnx --model /data/xiedong/fooocus_tensorRT/juggernautXL_version6Rundiffusion_onnx/ --task stable-diffusion-xl juggernautXL_version6Rundiffusion_onnx_optinmum

转TensorRT

stabilityai/stable-diffusion-xl-1.0-tensorrt

项目：https://huggingface.co/stabilityai/stable-diffusion-xl-1.0-tensorrt

TensorRT环境：

git clone https://github.com/rajeevsrao/TensorRT.git
cd TensorRT
git checkout release/9.2

stabilityai/stable-diffusion-xl-1.0-tensorrt项目

git lfs install 
git clone https://huggingface.co/stabilityai/stable-diffusion-xl-1.0-tensorrt
cd stable-diffusion-xl-1.0-tensorrt
git lfs pull
cd ..

进入容器：

docker run --gpus all -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v $PWD:/workspace nvcr.io/nvidia/pytorch:23.11-py3 /bin/bash

这个命令与之前的命令基本相同，只是加入了一些附加参数，具体如下：

--ipc=host：通过这个参数，容器将使用主机的进程间通信（IPC）命名空间。这是为了解决共享内存的问题，允许容器与主机共享共享内存段。
--ulimit memlock=-1：通过此参数设置内存锁定的限制，将其设置为无限制（-1）。这对 PyTorch 可能是必要的，因为 PyTorch 可能需要锁定一些内存以提高性能。
--ulimit stack=67108864：通过此参数设置栈的限制，将其设置为 64MB。这是为了满足容器中可能需要的栈大小。

这些参数的目的是确保容器在运行 PyTorch 时有足够的资源，并提供了必要的内存和栈限制。

安装环境：

cd demo/Diffusion
python3 -m pip install --upgrade pip
pip3 install -r requirements.txt
python3 -m pip install --pre --upgrade --extra-index-url https://pypi.nvidia.com tensorrt

更新前的版本：
tensorrt 8.6.1

# pip list
Package                   Version
------------------------- --------------------
absl-py                   2.0.0
accelerate                0.26.1
aiohttp                   3.8.6
aiosignal                 1.3.1
annotated-types           0.6.0
apex                      0.1
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
asttokens                 2.4.1
astunparse                1.6.3
async-timeout             4.0.3
attrs                     23.1.0
audioread                 3.0.1
beautifulsoup4            4.12.2
bleach                    6.1.0
blis                      0.7.11
cachetools                5.3.2
catalogue                 2.0.10
certifi                   2023.11.17
cffi                      1.16.0
charset-normalizer        3.3.1
click                     8.1.7
cloudpathlib              0.16.0
cloudpickle               3.0.0
cmake                     3.27.7
colored                   2.2.4
coloredlogs               15.0.1
comm                      0.2.0
confection                0.1.3
contourpy                 1.2.0
controlnet-aux            0.0.6
cubinlinker               0.3.0+2.g711d153
cuda-python               12.3.0rc4+8.ge6f99b5
cudf                      23.10.0
cugraph                   23.10.0
cugraph-dgl               23.10.0
cugraph-service-client    23.10.0
cugraph-service-server    23.10.0
cuml                      23.10.0
cupy-cuda12x              12.2.0
cycler                    0.12.1
cymem                     2.0.8
Cython                    3.0.5
dask                      2023.9.2
dask-cuda                 23.10.0
dask-cudf                 23.10.0
debugpy                   1.8.0
decorator                 5.1.1
defusedxml                0.7.1
diffusers                 0.23.1
distributed               2023.9.2
dm-tree                   0.1.8
einops                    0.7.0
exceptiongroup            1.1.3
execnet                   2.0.2
executing                 2.0.1
expecttest                0.1.3
fastjsonschema            2.19.0
fastrlock                 0.8.2
filelock                  3.13.1
flash-attn                2.0.4
flatbuffers               23.5.26
fonttools                 4.45.0
frozenlist                1.4.0
fsspec                    2023.10.0
ftfy                      6.1.3
gast                      0.5.4
google-auth               2.23.4
google-auth-oauthlib      0.4.6
graphsurgeon              0.4.6
grpcio                    1.59.3
huggingface-hub           0.20.3
humanfriendly             10.0
hypothesis                5.35.1
idna                      3.4
imageio                   2.33.1
importlib-metadata        6.8.0
iniconfig                 2.0.0
intel-openmp              2021.4.0
ipykernel                 6.26.0
ipython                   8.17.2
ipython-genutils          0.2.0
jedi                      0.19.1
Jinja2                    3.1.2
joblib                    1.3.2
json5                     0.9.14
jsonschema                4.20.0
jsonschema-specifications 2023.11.1
jupyter_client            8.6.0
jupyter_core              5.5.0
jupyter-tensorboard       0.2.0
jupyterlab                2.3.2
jupyterlab-pygments       0.2.2
jupyterlab-server         1.2.0
jupytext                  1.15.2
kiwisolver                1.4.5
langcodes                 3.3.0
lazy_loader               0.3
librosa                   0.10.1
llvmlite                  0.40.1
locket                    1.0.0
Markdown                  3.5.1
markdown-it-py            3.0.0
MarkupSafe                2.1.3
matplotlib                3.8.2
matplotlib-inline         0.1.6
mdit-py-plugins           0.4.0
mdurl                     0.1.2
mistune                   3.0.2
mkl                       2021.1.1
mkl-devel                 2021.1.1
mkl-include               2021.1.1
mock                      5.1.0
mpmath                    1.3.0
msgpack                   1.0.7
multidict                 6.0.4
murmurhash                1.0.10
nbclient                  0.9.0
nbconvert                 7.11.0
nbformat                  5.9.2
nest-asyncio              1.5.8
networkx                  3.2.1
ninja                     1.11.1.1
notebook                  6.4.10
numba                     0.57.1+1.gc2aae5dd0
numpy                     1.24.4
nvfuser                   0.0.21+gitunknown
nvidia-dali-cuda120       1.31.0
nvidia-pyindex            1.0.9
nvtx                      0.2.5
oauthlib                  3.2.2
onnx                      1.14.0
onnx-graphsurgeon         0.3.27
onnxruntime               1.15.1
opencv                    4.7.0
opencv-python             4.8.0.74
optree                    0.10.0
packaging                 23.2
pandas                    1.5.3
pandocfilters             1.5.0
parso                     0.8.3
partd                     1.4.1
pexpect                   4.8.0
Pillow                    9.2.0
pip                       23.3.2
platformdirs              4.0.0
pluggy                    1.3.0
ply                       3.11
polygraphy                0.49.1
pooch                     1.8.0
preshed                   3.0.9
prettytable               3.9.0
prometheus-client         0.18.0
prompt-toolkit            3.0.41
protobuf                  4.24.4
psutil                    5.9.4
ptxcompiler               0.8.1+2.g4c26c4c
ptyprocess                0.7.0
pure-eval                 0.2.2
pyarrow                   12.0.1
pyasn1                    0.5.1
pyasn1-modules            0.3.0
pybind11                  2.11.1
pybind11-global           2.11.1
pycocotools               2.0+nv0.8.0
pycparser                 2.21
pydantic                  2.5.1
pydantic_core             2.14.3
Pygments                  2.17.1
pylibcugraph              23.10.0
pylibcugraphops           23.10.0
pylibraft                 23.10.0
pynvml                    11.4.1
pyparsing                 3.1.1
pytest                    7.4.3
pytest-flakefinder        1.1.0
pytest-rerunfailures      12.0
pytest-shard              0.1.2
pytest-xdist              3.4.0
python-dateutil           2.8.2
python-hostlist           1.23.0
pytorch-quantization      2.1.2
pytz                      2023.3.post1
PyYAML                    6.0.1
pyzmq                     25.1.1
raft-dask                 23.10.0
referencing               0.31.0
regex                     2023.10.3
requests                  2.31.0
requests-oauthlib         1.3.1
rmm                       23.10.0
rpds-py                   0.13.1
rsa                       4.9
safetensors               0.4.2
scikit-image              0.22.0
scikit-learn              1.2.0
scipy                     1.11.3
Send2Trash                1.8.2
setuptools                68.2.2
six                       1.16.0
smart-open                6.4.0
sortedcontainers          2.4.0
soundfile                 0.12.1
soupsieve                 2.5
soxr                      0.3.7
spacy                     3.7.2
spacy-legacy              3.0.12
spacy-loggers             1.0.5
sphinx-glpi-theme         0.4.1
srsly                     2.4.8
stack-data                0.6.3
sympy                     1.12
tabulate                  0.9.0
tbb                       2021.11.0
tblib                     3.0.0
tensorboard               2.9.0
tensorboard-data-server   0.6.1
tensorboard-plugin-wit    1.8.1
tensorrt                  8.6.1
terminado                 0.18.0
thinc                     8.2.1
threadpoolctl             3.2.0
thriftpy2                 0.4.17
tifffile                  2023.12.9
timm                      0.9.12
tinycss2                  1.2.1
tokenizers                0.13.3
toml                      0.10.2
tomli                     2.0.1
toolz                     0.12.0
torch                     2.2.0a0+6a974be
torch-tensorrt            2.2.0a0
torchdata                 0.7.0a0
torchtext                 0.16.0a0
torchvision               0.17.0a0
tornado                   6.3.3
tqdm                      4.66.1
traitlets                 5.9.0
transformer-engine        1.0.0+66d91d5
transformers              4.31.0
treelite                  3.9.1
treelite-runtime          3.9.1
triton                    2.1.0+6e4932c
typer                     0.9.0
types-dataclasses         0.6.6
typing_extensions         4.8.0
ucx-py                    0.34.0
uff                       0.6.9
urllib3                   1.26.18
wasabi                    1.1.2
wcwidth                   0.2.13
weasel                    0.3.4
webencodings              0.5.1
Werkzeug                  3.0.1
wheel                     0.41.3
xdoctest                  1.0.2
xgboost                   1.7.6
yarl                      1.9.2
zict                      3.0.0
zipp                      3.17.0

https://pypi.org/project/tensorrt/#history

【深度学习】SDXL tensorRT 推理，Stable Diffusion 转onnx，转TensorRT,深度学习机器学习,深度学习,人工智能,SDXL,tensorRT

执行SDXL推理：

python3 demo_txt2img_xl.py   "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"   --build-static-batch   --use-cuda-graph   --num-warmup-runs 1   --width 1024   --height 1024   --denoising-steps 30  --version=xl-1.0   --onnx-dir /workspace/stable-diffusion-xl-1.0-tensorrt/sdxl-1.0-base   --onnx-refiner-dir /workspace/stable-diffusion-xl-1.0-tensorrt/sdxl-1.0-refiner

python3 demo_txt2img_xl.py   "a girl, 8k"   --build-static-batch   --use-cuda-graph   --num-warmup-runs 1   --width 1024   --height 1024   --denoising-steps 30  --version=xl-1.0   --onnx-dir /workspace/juggernautXL_version6Rundiffusion_onnx_optinmum

  -h, --help            显示帮助信息并退出
  --negative-prompt [NEGATIVE_PROMPT ...]
                        用于引导图像生成的负面提示（可以是多个）
  --batch-size {1,2,4}  批处理大小（重复提示）
  --batch-count BATCH_COUNT
                        顺序生成图像的数量，一次生成一个
  --denoising-steps DENOISING_STEPS
                        降噪步骤的数量
  --scheduler {DDIM,DDPM,EulerA,Euler,LCM,LMSD,PNDM,UniPC}
                        扩散过程的调度器
  --lora-scale LORA_SCALE [LORA_SCALE ...]
                        LoRA权重的比例，默认为1（必须在0和1之间）
  --lora-path LORA_PATH [LORA_PATH ...]
                        LoRA适配器的路径。例如：'latent-consistency/lcm-lora-sdv1-5'
  --onnx-opset {7,8,9,10,11,12,13,14,15,16,17,18}
                        选择用于导出模型的ONNX opset版本
  --onnx-dir ONNX_DIR   用于ONNX导出的输出目录
  --framework-model-dir FRAMEWORK_MODEL_DIR
                        HF保存模型的目录
  --engine-dir ENGINE_DIR
                        TensorRT引擎的输出目录
  --build-static-batch  使用固定批处理大小构建TensorRT引擎
  --build-dynamic-shape
                        使用动态图像形状构建TensorRT引擎
  --build-enable-refit  在构建期间启用TensorRT引擎的Refit选项
  --build-all-tactics   使用所有战术源构建TensorRT引擎
  --timing-cache TIMING_CACHE
                        预缓存的时间测量文件的路径，用于加速构建
  --use-cuda-graph      启用CUDA图
  --nvtx-profile        启用NVTX标记进行性能分析
  --torch-inference TORCH_INFERENCE
                        使用PyTorch运行推断（使用指定的编译模式），而不是TensorRT
  --seed SEED           随机生成器的种子，以获得一致的结果
  --output-dir OUTPUT_DIR
                        日志和图像工件的输出目录
  --hf-token HF_TOKEN   用于下载模型检查点的HuggingFace API访问令牌
  -v, --verbose         显示详细输出
  --version {xl-1.0,xl-turbo}
                        Stable Diffusion XL的版本
  --height HEIGHT       要生成图像的高度（必须是8的倍数）
  --width WIDTH         要生成图像的宽度（必须是8的倍数）
  --num-warmup-runs NUM_WARMUP_RUNS
                        性能基准测试之前的热身运行次数
  --guidance-scale GUIDANCE_SCALE
                        无分类器引导比例的值（必须大于1）
  --enable-refiner      启用SDXL-Refiner模型
  --image-strength IMAGE_STRENGTH
                        应用于input_image的变换强度（必须在0和1之间）
  --onnx-refiner-dir ONNX_REFINER_DIR
                        SDXL-Refiner ONNX模型的目录
  --engine-refiner-dir ENGINE_REFINER_DIR
                        SDXL-Refiner TensorRT引擎的目录

这个py代码对终端解析有时候有点问题，直接在代码里改一下，直接指定一下：

【深度学习】SDXL tensorRT 推理，Stable Diffusion 转onnx，转TensorRT,深度学习机器学习,深度学习,人工智能,SDXL,tensorRT

3090速度：
【深度学习】SDXL tensorRT 推理，Stable Diffusion 转onnx，转TensorRT,深度学习机器学习,深度学习,人工智能,SDXL,tensorRT

SDXL-LCM

python3 demo_txt2img_xl.py \
  "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" \
  --version=xl-1.0 \
  --onnx-dir /workspace/stable-diffusion-xl-1.0-tensorrt/lcm \
  --engine-dir /workspace/stable-diffusion-xl-1.0-tensorrt/lcm/engine-sdxl-lcm-nocfg \
  --scheduler LCM \
  --denoising-steps 4 \
  --guidance-scale 0.0 \
  --seed 42

SDXL-LCMLORA

python3 demo_txt2img_xl.py \
  "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" \
  --version=xl-1.0 \
  --onnx-dir /workspace/stable-diffusion-xl-1.0-tensorrt/lcmlora \
  --engine-dir /workspace/stable-diffusion-xl-1.0-tensorrt/lcm/engine-sdxl-lcmlora-nocfg \
  --scheduler LCM \
  --lora-path latent-consistency/lcm-lora-sdxl \
  --lora-scale 1.0 \
  --denoising-steps 4 \
  --guidance-scale 0.0 \
  --seed 42