Stable Diffusion: 利用Latent Diffusion Models实现高分辨率图像合成

这篇具有很好参考价值的文章主要介绍了Stable Diffusion: 利用Latent Diffusion Models实现高分辨率图像合成。希望对大家有所帮助。如果存在错误或未考虑完全的地方,请大家不吝赐教,您也可以点击"举报违法"按钮提交疑问。

原文链接: Stable Diffusion: 利用Latent Diffusion Models实现高分辨率图像合成

01 The shortcomings of the existing works?

  • Since these diffusion model typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations

02 What problem is addressed?

  • Reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity.
  • LDMs achieve new state of the art scores for image inpainting and class-conditional image synthesis and highly competitive performance on various tasks, including unconditional image generation, text-to-image synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs

03 What are the keys to the solutions?

  • By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner.
  • First, we train an autoencoder which provides a lower-dimensional (and thereby efficient) representational space which is perceptually equivalent to the data space.
  • For the latter, we design an architecture that connects transformers to the DM’s UNet backbone [69] and enables arbitrary types of token-based conditioning mechanisms.

04 What are the main contributions?

  • In contrast to purely transformer-based approaches [23, 64], our method scales more graceful to higher dimensional data and can thus (a) work on a compression level which provides more faithful and detailed reconstructions than previous work. (b) can be efficiently applied to high-resolution synthesis of megapixel images.
  • We achieve competitive performance on multiple tasks (unconditional image synthesis, inpainting, stochastic super-resolution) and datasets while significantly lowering computational costs.
  • our approach does not require a delicate weighting of reconstruction and generative abilities. This ensures extremely faithful reconstructions and requires very little regularization of the latent space.
  • We find that for densely conditioned tasks such as super-resolution, inpainting and semantic synthesis, our model can be applied in a convolutional fashion and render large, consistent images of ∼ 10242 px
  • we design a general-purpose conditioning mechanism based on cross-attention, enabling multi-modal training. We use it to train class-conditional, text-to-image and layout-to-image models.
  • we release pretrained latent diffusion and autoencoding models at https://github. com/CompVis/latent-diffusion which might be reusable for a various tasks besides training of DMs.

05 Related works?

  • Generative Models for Image Synthesis
  • Diffusion Probabilistic Models (DM)
  • Two-Stage Image Synthesis

06 Method descriptions

we utilize an autoencoding model which learns a space that is perceptually equivalent to the image space, but offers significantly reduced computational complexity.

Such an approach offers several advantages:

  • By leaving the high-dimensional image space, we obtain DMs which are computationally much more efficient because sampling is performed on a low-dimensional space.
  • We exploit the inductive bias of DMs inherited from their UNet architecture [69], which makes them particularly effective for data with spatial structure and therefore alleviates the need for aggressive, quality-reducing compression levels as required by previous approaches [23, 64].
  • Finally, we obtain general-purpose compression models whose latent space can be used to train multiple generative models and which can also be utilized for other downstream applications such as single-image CLIP-guided synthesis [25].

Perceptual Image Compression

given an image x ∈ R H × W × 3 x \in R^{H×W×3} xRH×W×3 in RGB space, the encoder E encodes x into a latent representation z = E ( x ) z = E(x) z=E(x), and the decoder D reconstructs the image from the latent, giving ̃ x = D ( z ) = D ( E ( x ) ) x = D(z)=D(E(x)) x=D(z)=D(E(x)), where z ∈ R h × w × c z ∈ R^{h×w×c} zRh×w×c.

Latent Diffusion Models

The corresponding objective in DM can be simplified to:

Stable Diffusion: 利用Latent Diffusion Models实现高分辨率图像合成

Compared to the high-dimensional pixel space, latent space is more suitable for likelihood-based generative models, as they can now (i) focus on the important, semantic bits of the data and (ii) train in a lower dimensional, computationally much more efficient space.

Since the forward process is fixed, z t z_t zt can be efficiently obtained from E during training, and samples from p ( z ) p(z) p(z) can be decoded to image space with a single pass through D.

Conditioning Mechanisms

We turn DMs into more flexible conditional image generators by augmenting their underlying UNet backbone with the cross-attention mechanism [94], which is effective for learning attention-based models of various input modalities [34,35].

To pre-process y y y from various modalities (such as language prompts) we introduce a domain specific encoder τ θ τ_θ τθ that projects y to an intermediate representation τ θ ( y ) ∈ R M × d τ τ_θ(y) ∈ R^{M×d_τ} τθ(y)RM×dτ , which is then mapped to the intermediate layers of the UNet via a cross-attention layer implementing $Attention(Q, K, V )=softmax (\frac{QK^T}{√d}) · V $, with:

Stable Diffusion: 利用Latent Diffusion Models实现高分辨率图像合成

See Fig. 3for a visual depiction.

Stable Diffusion: 利用Latent Diffusion Models实现高分辨率图像合成

Based on image-conditioning pairs, we then learn the conditional LDM via:

Stable Diffusion: 利用Latent Diffusion Models实现高分辨率图像合成

This conditioning mechanism is flexible as τ θ τ_θ τθ can be parameterized with domain-specific experts, e.g. (unmasked) transformers [94] when y y y are text prompts.

07 Results and Comparisons

Experimental findings:

  • LDMs trained in VQ-regularized latent spaces achieve better sample quality, even though the reconstruction capabilities of VQ-regularized first stage models slightly fall behind those of their continuous counterparts, cf . Tab. 8 .Therefore, we evaluate VQ-regularized LDMs in the remainder of the paper, unless stated differently.

Stable Diffusion: 利用Latent Diffusion Models实现高分辨率图像合成

Image Generation with Latent Diffusion

  • On CelebA-HQ, we report a new state-of-the-art FID of 5.11, outperforming previous likelihood-based models as well as GANs. We also outperform LSGM [93] where a latent diffusion model is trained jointly together with the first stage.
  • We outperform prior diffusion based approaches on all but the LSUN-Bedrooms dataset, where our score is close to ADM [15], despite utilizing half its parameters and requiring 4-times less train resources
  • LDMs consistently improve upon GAN-based methods in Precision and Recall, thus confirming the advantages of their mode-covering likelihood-based training objective over adversarial approaches.

Stable Diffusion: 利用Latent Diffusion Models实现高分辨率图像合成

In Fig. 4 we also show qualitative results on each dataset.

Stable Diffusion: 利用Latent Diffusion Models实现高分辨率图像合成

Conditional Latent Diffusion

Transformer Encoders for LDMs

For quantitative analysis, we follow prior work and evaluate text-to-image generation on the MS-COCO [51] validation set, where our model improves upon powerful AR [17, 66] and GAN-based [109] methods, cf . Tab. 2.

Stable Diffusion: 利用Latent Diffusion Models实现高分辨率图像合成

Stable Diffusion: 利用Latent Diffusion Models实现高分辨率图像合成

Convolutional Sampling Beyond 25 6 2 256^2 2562

Super-Resolution with Latent Diffusion

Our qualitative and quantitative results (see Fig. 10 and Tab. 5) show competitive performance and LDM-SR outperforms SR3 in FID while SR3 has a better IS.

Stable Diffusion: 利用Latent Diffusion Models实现高分辨率图像合成

Further, we conduct a user study comparing the pixel-baseline with LDM-SR. The results in Tab. 4 affirm the good performance of LDM-SR. PSNR and SSIM can be pushed by using a post-hoc guiding mechanism [15] and we implement this image-based guider via a perceptual loss.

Stable Diffusion: 利用Latent Diffusion Models实现高分辨率图像合成

Inpainting with Latent Diffusion

In particular, we compare the inpainting efficiency of LDM-1 (i.e. a pixel-based conditional DM) with LDM-4, for both KL and VQ regularizations, as well as VQLDM-4 without any attention in the first stage (see Tab. 8), where the latter reduces GPU memory for decoding at high resolutions.

Tab. 6 reports the training and sampling throughput at resolution 2562 and 5122, the total training time in hours per epoch and the FID score on the validation split after six epochs.

Stable Diffusion: 利用Latent Diffusion Models实现高分辨率图像合成

The comparison with other inpainting approaches in Tab. 7 shows that our model with attention improves the overall image quality as measured by FID over that of [88]

Stable Diffusion: 利用Latent Diffusion Models实现高分辨率图像合成

Stable Diffusion: 利用Latent Diffusion Models实现高分辨率图像合成

08 Ablation studies

On Perceptual Compression Tradeoffs

This section analyzes the behavior of our LDMs with different downsampling factors f ∈{1, 2, 4, 8, 16, 32}.

Tab. 8 shows hyperparameters and reconstruction performance of the first stage models used for the LDMs compared in this section.

Fig. 6 shows sample quality as a function of training progress for 2M steps of class-conditional models on the ImageNet [12] dataset.

Stable Diffusion: 利用Latent Diffusion Models实现高分辨率图像合成

In Fig. 7, we compare models trained on CelebAHQ [39] and ImageNet in terms sampling speed for different numbers of denoising steps with the DDIM sampler [84] and plot it against FID-scores [29]

Stable Diffusion: 利用Latent Diffusion Models实现高分辨率图像合成

Complex datasets such as ImageNet require reduced compression rates to avoid reducing quality. In summary, LDM-4 and -8 offer the best conditions for achieving high-quality synthesis results.

09 How this work can be improved?

  • While LDMs significantly reduce computational requirements compared to pixel-based approaches, their sequential sampling process is still slower than that of GANs.
  • The use of LDMs can be questionable when high precision is required.

10 Conclusions

  • We have presented latent diffusion models, a simple and efficient way to significantly improve both the training and sampling efficiency of denoising diffusion models without degrading their quality.
  • Based on this and our cross-attention conditioning mechanism, our experiments could demonstrate favorable results compared to state-of-the-art methods across a wide range of conditional image synthesis tasks without task-specific architectures.

原文链接: Stable Diffusion: 利用Latent Diffusion Models实现高分辨率图像合成文章来源地址https://www.toymoban.com/news/detail-493069.html

到了这里,关于Stable Diffusion: 利用Latent Diffusion Models实现高分辨率图像合成的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处: 如若内容造成侵权/违法违规/事实不符,请点击违法举报进行投诉反馈,一经查实,立即删除!

领支付宝红包 赞助服务器费用

相关文章

  • 【AIGC】5、Stable Diffusion 原型 | High-Resolution Image Synthesis with Latent Diffusion Models

    论文:High-Resolution Image Synthesis with Latent Diffusion Models 代码:https://github.com/CompVis/latent-diffusion 出处:CVPR2022 | 慕尼黑大学 贡献: 提出了潜在扩散模型,通过将像素空间转换到潜在空间,能够在保持图像生成效果的同时降低计算量 相比纯粹的 transformer-based 方法,本文提出的方

    2024年02月09日
    浏览(34)
  • 【论文简介】Stable Diffusion的基础论文:2112.High-Resolution Image Synthesis with Latent Diffusion Models

    稳定扩散生成模型(Stable Diffusion)是一种潜在的文本到图像扩散模型,能够在给定任何文本输入的情况下生成照片般逼真的图像 Stable Diffusion 是基于 latent-diffusion 并与 Stability AI and Runway合作实现的 paper: High-Resolution Image Synthesis with Latent Diffusion Models 本论文代码 :https://github.co

    2024年02月08日
    浏览(29)
  • Stable Diffusion - 超分辨率插件 StableSR v2 (768x768) 配置与使用

    欢迎关注我的CSDN:https://spike.blog.csdn.net/ 本文地址:https://spike.blog.csdn.net/article/details/131582734 论文:Exploiting Diffusion Prior for Real-World Image Super-Resolution StableSR 算法提出了一种新颖的方法,利用预训练的文本到图像扩散模型中封装的先验知识,来实现盲超分辨率(SR)。具体来说

    2024年02月15日
    浏览(27)
  • 由浅入深理解Latent Diffusion/Stable Diffusion(5):利用预训练模型快速开始自己的科研任务

    本系列博客导航 由浅入深理解latent diffusion/stable diffusion(1):写给初学者的图像生成入门课 由浅入深理解latent diffusion/stable diffusion(2):扩散生成模型的工作原理 由浅入深理解latent diffusion/stable diffusion(3):一步一步搭建自己的stable diffusion models

    2024年02月12日
    浏览(36)
  • Latent Diffusion Models

    High-Resolution Image Synthesis with Latent Diffusion Models(CVPR 2022) https://arxiv.org/abs/2112.10752 GitHub - CompVis/latent-diffusion: High-Resolution Image Synthesis with Latent Diffusion Models GitHub - CompVis/stable-diffusion: A latent text-to-image diffusion model 贡献 : 大大减少计算复杂度、提出了cross-attention的方法来实现

    2024年02月02日
    浏览(23)
  • 大模型 Dalle2 学习三部曲(一)Latent Diffusion Models学习

    Diffusion model 大获成功,但是它的短板也很明显,需要大量的计算资源,并且推理速度比较慢。如何才能提升Diffusion model的计算效率。业界有各种各样的改进,无疑 Latent Diffusion Models(潜在扩散模型,LDMs) 是比较成功的一篇,那就来学习一下LDMS是怎么做的吧 1,与基于变换

    2024年01月18日
    浏览(26)
  • high-resolution image synthesis with latent diffusion models

    如何通俗理解扩散模型? - 知乎 泻药。实验室最近人人都在做扩散,从连续到离散,从CV到NLP,基本上都被diffusion洗了一遍。但是观察发现,里面的数学基础并不是模型应用的必须。其实大部分的研究者都不需要理解扩散模型的数学本质,更需要的是对… https://zhuanlan.zhihu.

    2023年04月19日
    浏览(30)
  • 4、High-Resolution Image Synthesis with Latent Diffusion Models

    github地址 diffusion model明显的缺点是耗费大量的时间、计算资源,为此,论文将其应用于强大的预训练自编码器的潜在空间 ,这是首次允许在复杂性降低和细节保存之间达到一个近乎最佳的点,极大地提高了视觉保真度。通过在模型架构中引入交叉注意层,将扩散模型转化为

    2024年02月12日
    浏览(27)
  • 论文阅读--High-Resolution Image Synthesis with Latent Diffusion Models

    High-Resolution Image Synthesis with Latent Diffusion Models论文阅读 Abstract Introduction Diffusion model相比GAN可以取得更好的图片生成效果,然而该模型是一种自回归模型,需要反复迭代计算,因此训练和推理代价都很高。论文提出一种在潜在表示空间(latent space)上进行diffusion过程的方法,

    2024年01月17日
    浏览(49)

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

博客赞助

微信扫一扫打赏

请作者喝杯咖啡吧~博客赞助

支付宝扫一扫领取红包,优惠每天领

二维码1

领取红包

二维码2

领红包