Hugging Face使用Stable diffusion Diffusers Transformers Accelerate Pipelines VAE

这篇具有很好参考价值的文章主要介绍了Hugging Face使用Stable diffusion Diffusers Transformers Accelerate Pipelines VAE。希望对大家有所帮助。如果存在错误或未考虑完全的地方,请大家不吝赐教,您也可以点击"举报违法"按钮提交疑问。

Diffusers

A library that offers an implementation of various diffusion models, including text-to-image models.

提供不同扩散模型的实现的库,代码上最简洁,国内的问题是 huggingface 需要翻墙。

Transformers

A Hugging Face library that provides pre-trained deep learning models for natural language processing tasks.

提供了预训练深度学习模型,

Accelerate

This library, also from Hugging Face, simplifies the execution of deep learning models on multiple devices, such as multiple CPUs, GPUs, or even TPUs.

加速库,可以针对不同硬件CPUs, GPUs,TPUs 加快执行模型速度

Invisible_watermark

A package that allows embedding invisible watermarks in images. It is not used directly in the code shown, but could be useful for marking generated images.

不可见水印,可以给生成的图片加水印

Mediapy

A library that allows you to display and manipulate images and videos in a Jupyter notebook.

Pipelines

Pipelines provide a simple way to run state-of-the-art diffusion models in inference. Most diffusion systems consist of multiple independently-trained models and highly adaptable scheduler components - all of which are needed to have a functioning end-to-end diffusion system.

列如, Stable Diffusion 由3个独立的预训练模型组成

  • Conditional Unet
  • CLIP text encoder
  • a scheduler component, scheduler,
  • a CLIPFeatureExtractor,
  • as well as a safety checker. All of these components are necessary to run stable diffusion in inference even though they were trained or created independently from each other.

Stable diffusion using Hugging Face

最简单的调用

from diffusers import StableDiffusionPipeline


pipe = StableDiffusionPipeline.from_pretrained('CompVis/stable-diffusion-v1-4').to('cuda')

# Initialize a prompt
prompt = "a dog wearing hat"

# Pass the prompt in the pipeline
pipe(prompt).images[0]

Hugging Face使用Stable diffusion Diffusers Transformers Accelerate Pipelines VAE,人工智能

理解核心模块

上面的文生图流程就是使用的扩散模型(diffusion models), Stable diffusion 模型是潜扩散模型(Latent Diffusion Model, LDM)。具体概念参考:深入浅出讲解Stable Diffusion原理,新手也能看明白 - 知乎

在latent diffusion里有3个重要的部分

  1. A text encoder 文本编码器, in this case, a CLIP Text encoder
  2. An autoencoder, in this case, a 变分自编码器(Variational Auto Encoder)也叫 VAE
  3. A U-Net

CLIP Text Encoder

概念

CLIP(Contrastive Language–Image Pre-training) 基于对比学习的语言-图像预训练,它将文本作为输入,并将输出的结果向量存储在 embedding 属性中。CLIP 模型可以把图像和文本,嵌入到相同的潜在特征空间 (latent space)。

Hugging Face使用Stable diffusion Diffusers Transformers Accelerate Pipelines VAE,人工智能

任何机器模型都无法识别自然语言,需要将自然语言转换成一堆它能理解的数字,也叫embeddings,这个转换的过程可以分为2步

1. Tokenizer - 将文字(字词)切割,并使用lookup表来转换成数字
2. Token_To_Embedding Encoder - Converting those numerical sub-words into a representation that contains the representation of that text

代码

import torch, logging
## disable warnings
logging.disable(logging.WARNING)  
## Import the CLIP artifacts 
from transformers import CLIPTextModel, CLIPTokenizer
## Initiating tokenizer and encoder.
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=torch.float16)
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=torch.float16).to("cuda")

prompt = ["a dog wearing hat"]
tok =tokenizer(prompt, padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt") 
print(tok.input_ids.shape)
tok

Hugging Face使用Stable diffusion Diffusers Transformers Accelerate Pipelines VAE,人工智能

tokenizer 返回字典里有2个对象
1. input_ids -

A tensor of size 1x77 as one prompt was passed and padded to 77 max length. 49406 表示起始 token, 320 是 “a”, 1929 是 dog, 3309 是 wearing, 3801 是 hat, 49407 is the end of text token repeated till the pad length of 77.
2. attention_mask - 1 representing an embedded value and 0 representing padding.

for token in list(tok.input_ids[0,:7]): 
    print(f"{token}:{tokenizer.convert_ids_to_tokens(int(token))}")

Hugging Face使用Stable diffusion Diffusers Transformers Accelerate Pipelines VAE,人工智能

接着看Token_To_Embedding Encoder,它将 input_ids 转换成 embeddings

emb = text_encoder(tok.input_ids.to("cuda"))[0].half()
print(f"Shape of embedding : {emb.shape}")
emb

Hugging Face使用Stable diffusion Diffusers Transformers Accelerate Pipelines VAE,人工智能

从中可以看出, 每个1x77 的 tokenized 输入被转换成 1x77x768 纬度的 embedding. 由此可见,每个输入的单词被转换成 768-dimensional 空间.

在Stable diffusion pipeline的表现

Hugging Face使用Stable diffusion Diffusers Transformers Accelerate Pipelines VAE,人工智能

Stable diffusion 使用CLIP trained encoder 转换输入的文字,它成为U-net.的一个输入源。从另外一个方面来说,CLIP使用图片encoder和文字encoder,生成了在 latent space里相似的embeddings,这种相似更精确的定义是Contrastive objective。

VAE — Variational Auto Encoder变分自编码器

概念

Hugging Face使用Stable diffusion Diffusers Transformers Accelerate Pipelines VAE,人工智能

autoencoder 包含2个部分
1. Encoder takes an image as input and converts it into a low dimensional latent representation
2. Decoder takes the latent representation and converts it back into an image

从图中可见,Encoder像粉碎机直接将图粉碎成几个碎片,decoder 又从碎片整合出原图

代码

## To import an image from a URL 
from fastdownload import FastDownload  
## Imaging  library 
from PIL import Image 
from torchvision import transforms as tfms  
## Basic libraries 
import numpy as np 
import matplotlib.pyplot as plt 
%matplotlib inline  
## Loading a VAE model 
from diffusers import AutoencoderKL 
vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae", torch_dtype=torch.float16).to("cuda")
def load_image(p):
   '''     
   Function to load images from a defined path     
   '''    
    return Image.open(p).convert('RGB').resize((512,512))
def pil_to_latents(image):
    '''     
    Function to convert image to latents     
    '''     
    init_image = tfms.ToTensor()(image).unsqueeze(0) * 2.0 - 1.0   
    init_image = init_image.to(device="cuda", dtype=torch.float16)
    init_latent_dist = vae.encode(init_image).latent_dist.sample() * 0.18215     
    return init_latent_dist  
def latents_to_pil(latents):     
    '''     
    Function to convert latents to images     
    '''     
    latents = (1 / 0.18215) * latents     
    with torch.no_grad():         
        image = vae.decode(latents).sample     
    
    image = (image / 2 + 0.5).clamp(0, 1)     
    image = image.detach().cpu().permute(0, 2, 3, 1).numpy()      
    images = (image * 255).round().astype("uint8")     
    pil_images = [Image.fromarray(image) for image in images]        
    return pil_images
p = FastDownload().download('https://lafeber.com/pet-birds/wp-content/uploads/2018/06/Scarlet-Macaw-2.jpg')
img = load_image(p)
print(f"Dimension of this image: {np.array(img).shape}")
img

Hugging Face使用Stable diffusion Diffusers Transformers Accelerate Pipelines VAE,人工智能

开始使用 VAE encoder 压缩图片

latent_img = pil_to_latents(img)
print(f"Dimension of this latent representation: {latent_img.shape}")

Hugging Face使用Stable diffusion Diffusers Transformers Accelerate Pipelines VAE,人工智能

我们可以看到VAE 压缩一个 3 x 512 x 512 纬度的图片到 4 x 64 x 64 图片,压缩比例有48x,可以看看4通道的latent表现

fig, axs = plt.subplots(1, 4, figsize=(16, 4))
for c in range(4):
    axs[c].imshow(latent_img[0][c].detach().cpu(), cmap='Greys')

Hugging Face使用Stable diffusion Diffusers Transformers Accelerate Pipelines VAE,人工智能

理论上从这四张图中能得到原图的很多信息,接着我们用 decoder来往回解压缩。

decoded_img = latents_to_pil(latent_img)
decoded_img[0]

Hugging Face使用Stable diffusion Diffusers Transformers Accelerate Pipelines VAE,人工智能

从中我们可以看出VAE decoder 可以从48x compressed latent representation 还原原图。

注意2张图里的眼镜,其实有细微差别,整个流程不是无损的

在Stable diffusion pipeline 里扮演的角色

没有 VAE 加入, Stable diffusion 也能完整使用,使用VAE能减少生成高清图的计算量。 The latent diffusion models can perform diffusion in this latent space produced by the VAE encoder and once we have our desired latent outputs produced by the diffusion process, we can convert them back to the high-resolution image by using the VAE decoder. To get a better intuitive understanding of Variation Autoencoders and how they are trained, read this blog by Irhum Shafkat.

U-Net model

概念

U-Net model 有2个输入
1. Noisy latent or Noise- Noisy latents are latents produced by a VAE encoder (in case an initial image is provided) with added noise or it can take pure noise input in case we want to create a random new image based solely on a textual description
2. Text embeddings - CLIP-based embedding generated by input textual prompts

Hugging Face使用Stable diffusion Diffusers Transformers Accelerate Pipelines VAE,人工智能

U-Net model 的输出是可预测的 noise residual which the input noisy latent contains. In other words, it predicts the noise which is subtracted from the noisy latents to return the original de-noised latents.

代码

from diffusers import UNet2DConditionModel, LMSDiscreteScheduler
## Initializing a scheduler
scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000)
## Setting number of sampling steps
scheduler.set_timesteps(51)
## Initializing the U-Net model
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet", torch_dtype=torch.float16).to("cuda")

代码里 imported unet 也加入了 scheduler 。scheduler是用来确认指定diffusion 处理过程中指定步骤加入多少 noise latent

Hugging Face使用Stable diffusion Diffusers Transformers Accelerate Pipelines VAE,人工智能

从图中可以看出,diffusion 处理中,一开始noise比较高或许逐步降低

noise = torch.randn_like(latent_img) # Random noise
fig, axs = plt.subplots(2, 3, figsize=(16, 12))
for c, sampling_step in enumerate(range(0,51,10)):
    encoded_and_noised = scheduler.add_noise(latent_img, noise, timesteps=torch.tensor([scheduler.timesteps[sampling_step]]))
    axs[c//3][c%3].imshow(latents_to_pil(encoded_and_noised)[0])
    axs[c//3][c%3].set_title(f"Step - {sampling_step}")

Hugging Face使用Stable diffusion Diffusers Transformers Accelerate Pipelines VAE,人工智能

让我们看看 U-Net 如何从图片中去除noise。先加入些noise

encoded_and_noised = scheduler.add_noise(latent_img, noise, timesteps=torch.tensor([scheduler.timesteps[40]])) latents_to_pil(encoded_and_noised)[0]

Hugging Face使用Stable diffusion Diffusers Transformers Accelerate Pipelines VAE,人工智能

跑个 U-Net 并试着去噪

## Unconditional textual prompt
prompt = [""]
## Using clip model to get embeddings
text_input = tokenizer(prompt, padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt")
with torch.no_grad(): 
    text_embeddings = text_encoder(
        text_input.input_ids.to("cuda")
    )[0]
    
## Using U-Net to predict noise    
latent_model_input = torch.cat([encoded_and_noised.to("cuda").float()]).half()
with torch.no_grad():
    noise_pred = unet(
        latent_model_input,40,encoder_hidden_states=text_embeddings
    )["sample"]
## Visualize after subtracting noise 
latents_to_pil(encoded_and_noised- noise_pred)[0]

Hugging Face使用Stable diffusion Diffusers Transformers Accelerate Pipelines VAE,人工智能

如上图,噪声已经去掉了不少

扮演的角色

Latent diffusion 在 latent 空间里使用 U-Net 逐步的降噪以达到预期的效果。在每一步中,加入到latents 的 noise 数量将会达到最总的降噪输出。 U-Net 最早是由  this paper 提出的。U-Net 由encoder 和 decoder ,组合成 ResNet blocks。The stable diffusion U-Net also has cross-attention layers to provide them with the ability to condition(影响) the output based on the 输入的文字。 The Cross-attention layers are added to both the encoder and the decoder part of the U-Net usually between ResNet blocks. You can learn more about this U-Net architecture here.

组合

我们将试着将 CLIP text encoder, VAE, and U-Net 三者一起组合,看看如何走通文生图流程

回顾The Diffusion Process

stable diffusion mode 需要文字输入和seed。文字输入通过CLIP转换成 77*768 的数组,seed用来生成高斯噪音(4x64x64),它将会成为第一个latent image representation.

Note — You will notice that there is an additional dimension mentioned (1x) in the image like 1x77x768 for text embedding, that is because it represents the batch size of 1.

Hugging Face使用Stable diffusion Diffusers Transformers Accelerate Pipelines VAE,人工智能

Next, the U-Net iteratively denoises(降噪) the random latent image representations while conditioning(训练) on the text embeddings. The output of the U-Net is predicted(预测) noise residual(剩余), which is then used to compute conditioned(影响) latents via a scheduler algorithm. This process of denoising and text conditioning is repeated N times (We will use 50) to retrieve a better latent image representation.

Once this process is complete, the latent image representation (4x64x64) is decoded by the VAE decoder to retrieve the final output image (3x512x512).

Note — This iterative denoising is an important step for getting a good output image. Typical steps are in the range of 30–80. However, there are recent papers that claim to reduce it to 4–5 steps by using distillation techniques.

代码

import torch, logging
## disable warnings
logging.disable(logging.WARNING)  
## Imaging  library
from PIL import Image
from torchvision import transforms as tfms
## Basic libraries
import numpy as np
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import display
import shutil
import os
## For video display
from IPython.display import HTML
from base64 import b64encode

## Import the CLIP artifacts 
from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, UNet2DConditionModel, LMSDiscreteScheduler
## Initiating tokenizer and encoder.
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=torch.float16)
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=torch.float16).to("cuda")
## Initiating the VAE
vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae", torch_dtype=torch.float16).to("cuda")
## Initializing a scheduler and Setting number of sampling steps
scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000)
scheduler.set_timesteps(50)
## Initializing the U-Net model
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet", torch_dtype=torch.float16).to("cuda")
## Helper functions
def load_image(p):
    '''
    Function to load images from a defined path
    '''
    return Image.open(p).convert('RGB').resize((512,512))
def pil_to_latents(image):
    '''
    Function to convert image to latents
    '''
    init_image = tfms.ToTensor()(image).unsqueeze(0) * 2.0 - 1.0
    init_image = init_image.to(device="cuda", dtype=torch.float16) 
    init_latent_dist = vae.encode(init_image).latent_dist.sample() * 0.18215
    return init_latent_dist
def latents_to_pil(latents):
    '''
    Function to convert latents to images
    '''
    latents = (1 / 0.18215) * latents
    with torch.no_grad():
        image = vae.decode(latents).sample
    image = (image / 2 + 0.5).clamp(0, 1)
    image = image.detach().cpu().permute(0, 2, 3, 1).numpy()
    images = (image * 255).round().astype("uint8")
    pil_images = [Image.fromarray(image) for image in images]
    return pil_images
def text_enc(prompts, maxlen=None):
    '''
    A function to take a texual promt and convert it into embeddings
    '''
    if maxlen is None: maxlen = tokenizer.model_max_length
    inp = tokenizer(prompts, padding="max_length", max_length=maxlen, truncation=True, return_tensors="pt") 
    return text_encoder(inp.input_ids.to("cuda"))[0].half()

后续代码是StableDiffusionPipeline.from_pretrained  简化版本,主要展示过程

def prompt_2_img(prompts, g=7.5, seed=100, steps=70, dim=512, save_int=False):
    """
    Diffusion process to convert prompt to image
    """
    
    # Defining batch size
    bs = len(prompts) 
    
    # Converting textual prompts to embedding
    text = text_enc(prompts) 
    
    # Adding an unconditional prompt , helps in the generation process
    uncond =  text_enc([""] * bs, text.shape[1])
    emb = torch.cat([uncond, text])
    
    # Setting the seed
    if seed: torch.manual_seed(seed)
    
    # Initiating random noise
    latents = torch.randn((bs, unet.in_channels, dim//8, dim//8))
    
    # Setting number of steps in scheduler
    scheduler.set_timesteps(steps)
    
    # Adding noise to the latents 
    latents = latents.to("cuda").half() * scheduler.init_noise_sigma
    
    # Iterating through defined steps
    for i,ts in enumerate(tqdm(scheduler.timesteps)):
        # We need to scale the i/p latents to match the variance
        inp = scheduler.scale_model_input(torch.cat([latents] * 2), ts)
        
        # Predicting noise residual using U-Net
        with torch.no_grad(): u,t = unet(inp, ts, encoder_hidden_states=emb).sample.chunk(2)
            
        # Performing Guidance
        pred = u + g*(t-u)
        
        # Conditioning  the latents
        latents = scheduler.step(pred, ts, latents).prev_sample
        
        # Saving intermediate images
        if save_int: 
            if not os.path.exists(f'./steps'):
                os.mkdir(f'./steps')
            latents_to_pil(latents)[0].save(f'steps/{i:04}.jpeg')
            
    # Returning the latent representation to output an image of 3x512x512
    return latents_to_pil(latents)

最终使用

images = prompt_2_img(["A dog wearing a hat", "a photograph of an astronaut riding a horse"], save_int=False)
for img in images:display(img)

Hugging Face使用Stable diffusion Diffusers Transformers Accelerate Pipelines VAE,人工智能

Hugging Face使用Stable diffusion Diffusers Transformers Accelerate Pipelines VAE,人工智能

def prompt_2_img(prompts, g=7.5, seed=100, steps=70, dim=512, save_int=False):

参数解释
1. prompt - 文字,文生图
2. g or guidance scale - It’s a value that determines how close the image should be to the textual prompt. This is related to a technique called Classifier free guidance which improves the quality of the images generated. The higher the value of the guidance scale, more close it will be to the textual prompt
3. seed - This sets the seed from which the initial Gaussian noisy latents are generated
4. steps - Number of de-noising steps taken for generating the final latents.
5. dim - dimension of the image, for simplicity we are currently generating square images, so only one value is needed
6. save_int - This is optional, a boolean flag, if we want to save intermediate latent images, helps in visualization.

也可以参考的webui里的界面

Hugging Face使用Stable diffusion Diffusers Transformers Accelerate Pipelines VAE,人工智能

可视化整个过程 

Hugging Face使用Stable diffusion Diffusers Transformers Accelerate Pipelines VAE,人工智能

参考

https://towardsdatascience.com/stable-diffusion-using-hugging-face-501d8dbdd8

https://huggingface.co/blog/stable_diffusion文章来源地址https://www.toymoban.com/news/detail-726592.html

到了这里,关于Hugging Face使用Stable diffusion Diffusers Transformers Accelerate Pipelines VAE的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处: 如若内容造成侵权/违法违规/事实不符,请点击违法举报进行投诉反馈,一经查实,立即删除!

领支付宝红包 赞助服务器费用

相关文章

  • Hugging Face Transformers 萌新完全指南

    欢迎阅读《Hugging Face Transformers 萌新完全指南》,本指南面向那些意欲了解有关如何使用开源 ML 的基本知识的人群。我们的目标是揭开 Hugging Face Transformers 的神秘面纱及其工作原理,这么做不是为了把读者变成机器学习从业者,而是让为了让读者更好地理解 transformers 从而能

    2024年04月22日
    浏览(25)
  • [算法前沿]--028-基于Hugging Face -Transformers的预训练模型微调

    本章节将使用 Hugging Face 生态系统中的库——🤗 Transformers来进行自然语言处理工作(NLP)。 Transformers的历史 以下是 Transformer 模型(简短)历史中的一些参考点: Transformer 架构于 2017 年 6 月推出。原始研究的重点是翻译任务。随后推出了几个有影响力的模型,包括: 2018 年 6

    2024年02月11日
    浏览(36)
  • hugging face开源的transformers模型可快速搭建图片分类任务

    2017年,谷歌团队在论文「Attention Is All You Need」提出了创新模型,其应用于NLP领域架构Transformer模型。从模型发布至今,transformer模型风靡微软、谷歌、Meta等大型科技公司。且目前有模型大一统的趋势,现在transformer 模型不仅风靡整个NLP领域,且随着VIT SWIN等变体模型,成功把

    2024年02月06日
    浏览(33)
  • Hugging Face快速入门(重点讲解模型(Transformers)和数据集部分(Datasets))

    本文主要包括如下内容: Hugging Face是什么,提供了哪些内容 Hugging Face模型的使用(Transformer类库) Hugging Face数据集的使用(Datasets类库) Hugging Face Hub和 Github 类似,都是Hub(社区)。Hugging Face可以说的上是机器学习界的Github。Hugging Face为用户提供了以下主要功能: 模型仓库(

    2024年01月21日
    浏览(33)
  • Hugging Face 的 Transformers 库快速入门 (一)开箱即用的 pipelines

    注:本系列教程仅供学习使用, 由原作者授权, 均转载自小昇的博客 。 Transformers 是由 Hugging Face 开发的一个 NLP 包,支持加载目前绝大部分的预训练模型。随着 BERT、GPT 等大规模语言模型的兴起,越来越多的公司和研究者采用 Transformers 库来构建 NLP 应用,因此熟悉 Transformer

    2023年04月27日
    浏览(39)
  • 【扩散模型】11、Stable Diffusion | 使用 Diffusers 库来看看 Stable Diffusion 的结构

    参考:HuggingFace 参考:https://jalammar.github.io/illustrated-stable-diffusion/ Stable Diffusion 这个模型架构是由 Stability AI 公司推于2022年8月由 CompVis、Stability AI 和 LAION 的研究人员在 Latent Diffusion Model 的基础上创建并推出的。 其原型是(Latent Diffusion Model),一般的扩散模型都需要直接在像

    2024年01月16日
    浏览(35)
  • 【扩散模型】12、Stable Diffusion | 使用 Diffusers 库来看看 Stable Diffusion 的结构

    参考:HuggingFace 参考:https://jalammar.github.io/illustrated-stable-diffusion/ Stable Diffusion 这个模型架构是由 Stability AI 公司推于2022年8月由 CompVis、Stability AI 和 LAION 的研究人员在 Latent Diffusion Model 的基础上创建并推出的。 其原型是(Latent Diffusion Model),一般的扩散模型都需要直接在像

    2024年01月18日
    浏览(45)
  • 使用 Docker 和 Diffusers 快速上手 Stable Video Diffusion 图生视频大模型

    本篇文章聊聊,如何快速上手 Stable Video Diffusion (SVD) 图生视频大模型。 月底计划在机器之心的“AI技术论坛”做关于使用开源模型 “Stable Diffusion 模型” 做有趣视频的实战分享。 因为会议分享时间有限,和之前一样,比较简单的部分,就用博客文章的形式来做补充分享吧。

    2024年01月24日
    浏览(53)
  • diffusers库中stable Diffusion模块的解析

    diffusers中,stable Diffusion v1.5主要由以下几个部分组成 下面给出具体的结构说明。 “text_encoder block” “vae block” “unet block” “feature extractor block” “tokenizer block” “safety_checker block” “scheduler block”

    2024年02月03日
    浏览(28)
  • Stable Diffusion XL on diffusers

    翻译自:https://huggingface.co/docs/diffusers/using-diffusers/sdxl v0.24.0 非逐字翻译 Stable Diffusion XL (SDXL) 是一个强大的图像生成模型,其在上一代 Stable Diffusion 的基础上主要做了如下优化: 参数量增加:SDXL 中 Unet 的参数量比前一代大了 3 倍,并且 SDXL 还引入了第二个 text-encoder(OpenCL

    2024年03月14日
    浏览(29)

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

博客赞助

微信扫一扫打赏

请作者喝杯咖啡吧~博客赞助

支付宝扫一扫领取红包,优惠每天领

二维码1

领取红包

二维码2

领红包