大话AI绘画技术原理与算法优化

这篇具有很好参考价值的文章主要介绍了大话AI绘画技术原理与算法优化。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

引子

博主很长一段时间都没有发文，确实是在忙一些技术研究。

如标题所示，本篇博文主要把近段时间的研究工作做一个review。

看过各种相关技术的公关文章，林林总总，水分很多。

也确实没有多少人能把一些技术细节用一些比较通俗的语言阐述清楚。

故此，再一次冠以大话为题，对AI绘画主要是stable diffusion做一个技术梳理。

如何学习以及相关资源

相信很多朋友都想入门到这个技术领域捣腾捣腾，

而摆在眼前的确是一条臭水沟。

为什么这么说，让我们来看一些数据。

Hardware: 32 x 8 x A100 GPUs
Optimizer: AdamW
Gradient Accumulations: 2
Batch: 32 x 8 x 2 x 4 = 2048
Learning rate: warmup to 0.0001 for 10,000 steps and then kept constant

Hardware Type: A100 PCIe 40GB
Hours used: 150000
Cloud Provider: AWS
Compute Region: US-east
Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid): 11250 kg CO2 eq.

摘自：CompVis/stable-diffusion-v1-4 · Hugging Face

该模型是在亚马逊云计算服务上使用256个NVIDIA A100 GPU训练，共花费15万个GPU小时，成本为60万美元

摘自：Stable Diffusion - 维基百科，自由的百科全书 (wikipedia.org)

这个数据就是一个劝退警告，但是由于效果太过于“吓人”，所以飞蛾扑火，全世界都打起架来了。

当然，刚开始学习，就直接奔着最终章去，确实也不是很现实。

随着这个领域的爆火，各种资源爆炸式增长。

以下是博主给出的一部分参考资源，便于参阅。

相关整合资源:

heejkoo/Awesome-Diffusion-Models: A collection of resources and papers on Diffusion Models (github.com)

第三方:

Generative Deep Learning (keras.io)

huggingface/diffusers: 🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch (github.com)

AUTOMATIC1111/stable-diffusion-webui: Stable Diffusion web UI (github.com)

官方:

CompVis/stable-diffusion: A latent text-to-image diffusion model (github.com)

Stability-AI/stablediffusion: High-Resolution Image Synthesis with Latent Diffusion Models (github.com)

想要开箱式快速上手，建议把keras社区的这个生成花朵的玩具跑起来感受一下。

Denoising Diffusion Implicit Models (keras.io)

这个是一个非常简洁的实现，麻雀虽小五脏俱全，可以快速热身起来。

Stable Diffusion的基本原理

这里博主并不打算展开讲解过多，

只做一个大话概览阐述，便于快速入门了解。

若还有疑惑，大家可以参阅其他的资源。

了解原理，仅仅阅读代码肯定是远远不够的，

但是不阅读代码，你就不知道具体的实现细节。

官方实现，那个代码仓库真是一座屎山，乱七八糟的。

所以极佳的阅读版本是keras社区的实现。

keras-cv/keras_cv/models/stable_diffusion

stable_diffusion的主要组件如下:

1.文案编码器：

keras-cv/text_encoder.py

负责对输入的文字进行特征编码，用于引导 diffusion 模型进行内容生成。

2.潜在特征编码器：

keras-cv/image_encoder.py

将图片编码成潜在特征

3.潜在特征解码器：

keras-cv/decoder.py

将潜在特征解码成图片

4.diffusion 模型:

keras-cv/diffusion_model.py

由噪声和文字语义生成目标潜在特征。

用通俗的话来说编码器就是压缩，解码器就是解压，diffusion 模型就是编辑压缩的信息，而文案是引导编辑的方向。

5.辅助组件:

keras-cv/clip_tokenizer.py

针对文案输入的编码转换以及预处理。

简而言之就是将文字转换成数字。

keras-cv/noise_scheduler.py

噪声规划，主要用于训练和合成的加噪和去噪的比率计算。

整体的架构串起来就是stable_diffusion

keras-cv/stable_diffusion.py

整个使用的流程大概是这个样子的：

文字生成图片:

文字 -> clip_tokenizer -> text_encoder -> diffusion_model -> decoder

图片生成图片:

[[图片 -> image_encoder] + [文字 -> clip_tokenizer -> text_encoder]] -> diffusion_model -> decoder

这是主流的两种做法，还有通过mask进行内容修复，以及通过其他模块辅助生成的，例如生成小姐姐之类等等。

列完组件，

你就会发现涉及的技术并不简单。

说好的，要讲原理的。

一句话描述：这是一个信号压缩和解压的算法。

其中的信号是通过噪声扩散建的模，建模后符合正向和逆向的数学规律。

而在正向和逆向中都可以注入噪声或者改变噪声来达到信号引导重建。

临时开个小叉：

我们知道:

声音文件wav进行余弦变换就可以压缩成Mp3

位图文件bmp进行余弦变换就可以压缩成jpg

那么你说如果我们对余弦变换进行数学编辑，是不是可以直接编辑目标的mp3或者目标的jpg

你答对了，是可以的。

最经典的做法就是采用余弦变换对信号进行滤波，然后可以做到降噪，也就可以做到音频降噪或者图片降噪。

我们回到主题上来，基于噪声建模之后，有什么好处，这样做了之后故事变得更加有意思的。

只要编辑后的数据符合噪声分布，那理论上来说，我们可以采用正向或者逆向扩散的方式对任意时刻的信号进行编辑。

而Stable Diffusion 就是这样一个技术方案。

训练时候它将数据通过noise_scheduler进行diffusion_model正向扩散分治压缩，

生成使用的时候通过diffusion_model进行逆向扩散编辑解压，

最终达到可以生成任意信号组合的图片。

我知道你有疑问了，那是不是在生成的时候可以在任意时候插入信息，或者是正向扩散和逆向扩散混合着来。

没错，完全可以的。

图片生成图片，就是在逆向扩散过程中插入一个我们预设的图片信息节点与噪声进行混合，然后通过文字语义引导合成的过程。

知道了技术原理，我相信你们跟我一样会有一个非常大胆的假设。

假设把全世界的电脑连接起来，然后采用扩散分治，那完全可以让世界上任何一个地区任何一个人的一台机器，帮你把某个节点的扩散编辑给算了。

那这个事情就恐怖了，全世界都是p2p 算力共享了，人工智能的大时代就到来了。

就问一个问题，现在上船还来得及吗？

博主也不知道，只是这个事已经是铁板钉钉了，趋势不可逆。

关键算法以及相关优化

前面讲到技术原理，但是到底其中关键算法是什么？

答案是：跨模态注意力。

一切都由谷歌的一篇论文开始。

[1706.03762] Attention Is All You Need (arxiv.org)

在这篇论文出现后，百花齐放，各种attention, 各种transformer层出不穷。

transformer架构的提出，给出不同维度数据建模的可行性。

关于这方面的发展和资源，博主只给出如下资源。

一个值得关注的技术github:

lucidrains (Phil Wang) · GitHub

lucidrains是一个非常勤奋且高产的人，几乎所有第三方的transformer实现都有他的影子，当然也包括stable_diffusion。

在stable_diffusion中也是因为使用transformer所以带来了严重的计算资源的消耗。

针对stable_diffusion 的算法优化，各大厂也是勤奋的。

附相关信息:

英特尔

Accelerating Stable Diffusion Inference on Intel CPUs (huggingface.co)

高通

World’s first on-device demonstration of Stable Diffusion on an Android phone | Qualcomm

苹果

Stable Diffusion with Core ML on Apple Silicon - Apple Machine Learning Research

谷歌

[2304.11267] Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations (arxiv.org)

当然我们这里讲的优化指的是使用阶段的优化，并不是训练阶段的优化。

这是完全不同的两件事，而现在国内泛指的stable_diffusion训练也不是指的从零训练，

而是在stable_diffusion的基础上进行二次训练微调。

毕竟训练成本摆在那里，没有几家公司会这么豪气去复现，包括国内某些大厂。

训练优化展开一时半会也讲不完，所以博主着重讲一下，使用阶段的优化。

参考信息来自:

Optimizations · AUTOMATIC1111/stable-diffusion-webui Wiki (github.com)

commandline argument	explanation
`--opt-sdp-attention`	Faster speeds than using xformers, only available for user who manually install torch 2.0 to their venv. (non-deterministic)
`--opt-sdp-no-mem-attention`	Faster speeds than using xformers, only available for user who manually install torch 2.0 to their venv. (deterministic, slight slower than `--opt-sdp-attention`)
`--xformers`	Use xformers library. Great improvement to memory consumption and speed. Will only be enabled on small subset of configuration because that's what we have binaries for. Documentation
`--force-enable-xformers`	Enables xformers above regardless of whether the program thinks you can run it or not. Do not report bugs you get running this.
`--opt-split-attention`	Cross attention layer optimization significantly reducing memory use for almost no cost (some report improved performance with it). Black magic. On by default for `torch.cuda`, which includes both NVidia and AMD cards.
`--disable-opt-split-attention`	Disables the optimization above.
`--opt-sub-quad-attention`	Sub-quadratic attention, a memory efficient Cross Attention layer optimization that can significantly reduce required memory, sometimes at a slight performance cost. Recommended if getting poor performance or failed generations with a hardware/software configuration that xformers doesn't work for. On macOS, this will also allow for generation of larger images.
`--opt-split-attention-v1`	Uses an older version of the optimization above that is not as memory hungry (it will use less VRAM, but will be more limiting in the maximum size of pictures you can make).
`--medvram`	Makes the Stable Diffusion model consume less VRAM by splitting it into three parts - cond (for transforming text into numerical representation), first_stage (for converting a picture into latent space and back), and unet (for actual denoising of latent space) and making it so that only one is in VRAM at all times, sending others to CPU RAM. Lowers performance, but only by a bit - except if live previews are enabled.
`--lowvram`	An even more thorough optimization of the above, splitting unet into many modules, and only one module is kept in VRAM. Devastating for performance.
`*do-not-batch-cond-uncond`	Prevents batching of positive and negative prompts during sampling, which essentially lets you run at 0.5 batch size, saving a lot of memory. Decreases performance. Not a command line option, but an optimization implicitly enabled by using `--medvram` or `--lowvram`.
`--always-batch-cond-uncond`	Disables the optimization above. Only makes sense together with `--medvram` or `--lowvram`
`--opt-channelslast`	Changes torch memory type for stable diffusion to channels last. Effects not closely studied.
`--upcast-sampling`	For Nvidia and AMD cards normally forced to run with `--no-half`, should improve generation speed.