1. Diffusers能带来什么
1.1 Overview
Diffusers是集成state-of-the-art预训练diffusion模型库,用于生成图像、音频甚至3D结构。
Diffusers库注重可用性而非高性能。
Diffusers主要提供三项能力:
- State-of-the-art diffusion pipelines,低代码推理。
- Interchangeable noise schedulers,便于平衡生成速度和质量。
- Pretrained models,构建自己的diffusion模型。
1.2 支持管道
Pipeline | 文章/项目 | 任务 |
---|---|---|
alt_diffusion | AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities | Image-to-Image Text-Guided Generation |
audio_diffusion | Audio Diffusion | Unconditional Audio Generation |
controlnet | Adding Conditional Control to Text-to-Image Diffusion Models | Image-to-Image Text-Guided Generation |
cycle_diffusion | Unifying Diffusion Models’ Latent Space, with Applications to CycleDiffusion and Guidance | Image-to-Image Text-Guided Generation |
dance_diffusion | Dance Diffusion | Unconditional Audio Generation |
ddpm | Denoising Diffusion Probabilistic Models | Unconditional Image Generation |
ddim | Denoising Diffusion Implicit Models | Unconditional Image Generation |
if | IF | Image Generation |
if_img2img | IF | Image-to-Image Generation |
if_inpainting | IF | Image-to-Image Generation |
latent_diffusion | High-Resolution Image Synthesis with Latent Diffusion Models | Text-to-Image Generation |
latent_diffusion | High-Resolution Image Synthesis with Latent Diffusion Models | Super Resolution Image-to-Image |
latent_diffusion_uncond | High-Resolution Image Synthesis with Latent Diffusion Models | Unconditional Image Generation |
paint_by_example | Paint by Example: Exemplar-based Image Editing with Diffusion Models | Image-Guided Image Inpainting |
pndm | Pseudo Numerical Methods for Diffusion Models on Manifolds | Unconditional Image Generation |
score_sde_ve | Score-Based Generative Modeling through Stochastic Differential Equations | Unconditional Image Generation |
score_sde_vp | Score-Based Generative Modeling through Stochastic Differential Equations | Unconditional Image Generation |
semantic_stable_diffusion | Semantic Guidance | Text-Guided Generation |
stable_diffusion_text2img | Stable Diffusion | Text-to-Image Generation |
stable_diffusion_img2img | Stable Diffusion | Image-to-Image Text-Guided Generation |
stable_diffusion_inpaint | Stable Diffusion | Text-Guided Image Inpainting |
stable_diffusion_panorama | MultiDiffusion | Text-to-Panorama Generation |
stable_diffusion_pix2pix | InstructPix2Pix: Learning to Follow Image Editing Instructions | Text-Guided Image Editing |
stable_diffusion_pix2pix_zero | Zero-shot Image-to-Image Translation | Text-Guided Image Editing |
stable_diffusion_attend_and_excite | Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models | Text-to-Image Generation |
stable_diffusion_self_attention_guidance | Improving Sample Quality of Diffusion Models Using Self-Attention Guidance | Text-to-Image Generation Unconditional Image Generation |
stable_diffusion_image_variation | Stable Diffusion Image Variations | Image-to-Image Generation |
stable_diffusion_latent_upscale | Stable Diffusion Latent Upscaler | Text-Guided Super Resolution Image-to-Image |
stable_diffusion_model_editing | Editing Implicit Assumptions in Text-to-Image Diffusion Models | Text-to-Image Model Editing |
stable_diffusion_2 | Stable Diffusion 2 | Text-to-Image Generation |
stable_diffusion_2 | Stable Diffusion 2 | Text-Guided Image Inpainting |
stable_diffusion_2 | Depth-Conditional Stable Diffusion | Depth-to-Image Generation |
stable_diffusion_2 | Stable Diffusion 2 | Text-Guided Super Resolution Image-to-Image |
stable_diffusion_safe | Safe Stable Diffusion | Text-Guided Generation |
stable_unclip | Stable unCLIP | Text-to-Image Generation |
stable_unclip | Stable unCLIP | Image-to-Image Text-Guided Generation |
stochastic_karras_ve | Elucidating the Design Space of Diffusion-Based Generative Models | Unconditional Image Generation |
text_to_video_sd | Modelscope’s Text-to-video-synthesis Model in Open Domain | Text-to-Video Generation |
unclip | Hierarchical Text-Conditional Image Generation with CLIP Latents(implementation by kakaobrain) | Text-to-Image Generation |
versatile_diffusion | Versatile Diffusion: Text, Images and Variations All in One Diffusion Model | Text-to-Image Generation |
versatile_diffusion | Versatile Diffusion: Text, Images and Variations All in One Diffusion Model | Image Variations Generation |
versatile_diffusion | Versatile Diffusion: Text, Images and Variations All in One Diffusion Model | Dual Image and Text Guided Generation |
vq_diffusion | Vector Quantized Diffusion Model for Text-to-Image Synthesis | Text-to-Image Generation |
1.3 DiffusionPipeline
DiffusionPipeline是高度抽象的端到端接口,huggingdace-diffusers所有model和scheduler都包含在内,方便启动推理。
任务 | 描述 | Pipeline |
---|---|---|
Unconditional Image Generation | generate an image from Gaussian noise | unconditional_image_generation |
Text-Guided Image Generation | generate an image given a text prompt | conditional_image_generation |
Text-Guided Image-to-Image Translation | adapt an image guided by a text prompt | img2img |
Text-Guided Image-Inpainting | fill the masked part of an image given the image, the mask and a text prompt | inpaint |
Text-Guided Depth-to-Image Translation | adapt parts of an image guided by a text prompt while preserving structure via depth estimation | depth2img |
2. 任务管道介绍
使用diffusers一个很重要的、需要特别注意的点是区分推理和训练管道之间的关系。
2.1 直接推理管道
2.1.1 Unconditional image generation
Unconditional image generation相对简单,管道内模型生成图像不需要任何额外的上下文信息(文字、图像等)。
Unconditional image generation管道生成的图像只取决于训练数据。
Pipeline:DiffusionPipeline
2.1.2 Text-to-image generation
Text-to-image generation即Conditional image generation,允许从text prompt生成图像。text被转换embeddings用于condition模型从noise中生成图像。
Pipeline:DiffusionPipeline
2.1.3 Text-guided image-to-image generation
Text-guided image-to-image generation允许以text prompt和一张初始图像作为限制生成一张新图像。
Pipeline:StableDiffusionImg2ImgPipeline
2.1.4 Text-guided image-inpainting
Text-guided image-inpainting允许通过mask和text prompt编辑图像中的特定部分。
Pipeline:StableDiffusionInpaintPipeline
2.1.5 Text-guided depth-to-image generation
Text-guided depth-to-image generation允许以text prompt和一张初始图像作为限制生成一张新图像。可通过depth_map参数保留图像depth结构,如不传递depth_map则会通过depth-estimation估计depth。
Pipeline:StableDiffusionDepth2ImgPipeline
2.2 训练管道
2.2.1 overview
任务 | 是否支持Accelerate | 是否提供Datasets |
---|---|---|
Unconditional Image Generation | ✅ | ✅ |
Text-to-Image fine-tuning | ✅ | ✅ |
Textual Inversion | ✅ | - |
Dreambooth | ✅ | - |
Training with LoRA | ✅ | - |
ControlNet | ✅ | ✅ |
InstructPix2Pix | ✅ | ✅ |
Custom Diffusion | ✅ | ✅ |
2.2.2 Unconditional Image Generation
不以任何文本或图像为条件。它只生成与训练数据分布相似的图像。
accelerate launch train_unconditional.py \
--dataset_name="huggan/flowers-102-categories" \
--resolution=64 \
--output_dir="ddpm-ema-flowers-64" \
--train_batch_size=16 \
--num_epochs=100 \
--gradient_accumulation_steps=1 \
--learning_rate=1e-4 \
--lr_warmup_steps=500 \
--mixed_precision=no \
--push_to_hub
2.2.3 Text-to-Image fine-tuning
以text prompt生成图像的训练流程,如Stable Diffusion模型。
accelerate launch --mixed_precision="fp16" train_text_to_image.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--dataset_name=$dataset_name \
--use_ema \
--resolution=512 --center_crop --random_flip \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--gradient_checkpointing \
--max_train_steps=15000 \
--learning_rate=1e-05 \
--max_grad_norm=1 \
--lr_scheduler="constant" --lr_warmup_steps=0 \
--output_dir="sd-pokemon-model"
2.2.4 Textual Inversion
Textual Inversion从少量示例图像中捕捉novel concepts。这些学习到的概念可以用于个性化图像生成的prompt,更好地控制生成的图像。
accelerate launch textual_inversion.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--train_data_dir=$DATA_DIR \
--learnable_property="object" \
--placeholder_token="<cat-toy>" --initializer_token="toy" \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--max_train_steps=3000 \
--learning_rate=5.0e-04 --scale_lr \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--output_dir="textual_inversion_cat"
2.2.5 Dreambooth
DreamBooth是一种个性化文本到图像模型的方法,就像Stable Diffusion一样,只给出一个主题的几张(3-5张)图像。它允许模型在不同的场景、姿势和视图中生成主体的情境化图像。
python train_dreambooth_flax.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--output_dir=$OUTPUT_DIR \
--instance_prompt="a photo of sks dog" \
--resolution=512 \
--train_batch_size=1 \
--learning_rate=5e-6 \
--max_train_steps=400
2.2.6 LoRA
LoRA: Low-Rank Adaptation of Large Language Models是一种在消耗较少内存的同时加速大型模型训练的训练方法。它将成对的秩分解权重矩阵(称为更新矩阵)添加到现有的权重中,并且只训练那些新添加的权重。
accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--dataset_name=$DATASET_NAME \
--dataloader_num_workers=8 \
--resolution=512 --center_crop --random_flip \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--max_train_steps=15000 \
--learning_rate=1e-04 \
--max_grad_norm=1 \
--lr_scheduler="cosine" --lr_warmup_steps=0 \
--output_dir=${OUTPUT_DIR} \
--push_to_hub \
--hub_model_id=${HUB_MODEL_ID} \
--report_to=wandb \
--checkpointing_steps=500 \
--validation_prompt="A pokemon with blue eyes." \
--seed=1337
2.2.7 ControlNet
ControlNet相较单纯img2img更加精准和有效,可以直接提取画面的构图,人物的姿势和画面的深度信息,并以此为条件限制图像生成。
accelerate launch train_controlnet.py \
--pretrained_model_name_or_path=$MODEL_DIR \
--output_dir=$OUTPUT_DIR \
--dataset_name=fusing/fill50k \
--resolution=512 \
--learning_rate=1e-5 \
--validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \
--validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
--train_batch_size=4
2.2.8 InstructPix2Pix
InstructPix2Pix使用:给定输入图像和编辑指令,告诉模型要做什么,模型将遵循这些指令来编辑图像。
accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--dataset_name=$DATASET_ID \
--enable_xformers_memory_efficient_attention \
--resolution=256 --random_flip \
--train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing \
--max_train_steps=15000 \
--checkpointing_steps=5000 --checkpoints_total_limit=1 \
--learning_rate=5e-05 --max_grad_norm=1 --lr_warmup_steps=0 \
--conditioning_dropout_prob=0.05 \
--mixed_precision=fp16 \
--seed=42
2.2.9 Custom Diffusion
仅优化文本到图像扩散模型的交叉注意力层中参数即可高效学会新概念。面对多个概念组合时,可以先单独训练各个概念模型,再通过约束优化将多个微调模型合并成一个。
accelerate launch train_custom_diffusion.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--output_dir=$OUTPUT_DIR \
--class_data_dir=./real_reg/samples_cat/ \
--with_prior_preservation --real_prior --prior_loss_weight=1.0 \
--class_prompt="cat" --num_class_images=200 \
--instance_prompt="photo of a <new1> cat" \
--resolution=512 \
--train_batch_size=2 \
--learning_rate=1e-5 \
--lr_warmup_steps=0 \
--max_train_steps=250 \
--scale_lr --hflip \
--modifier_token "<new1>" \
--validation_prompt="<new1> cat sitting in a bucket" \
--report_to="wandb"
3. Prompt Engineering
Weighting prompts
diffusers中提供的功能本质基本为text2img,text2img基于给定的prompt生成图像。prompt理应包括模型应该生成的多个概念,然往往事违人愿,故通常需要或多或少地对部分prompt进行加权以做到强调和去除强调。
扩散模型的工作原理是用上下文化的文本嵌入来调节扩散模型的交叉注意力层。因此,强调(或去除强调)提示的某些部分的简单方法是增加或减少与提示的相关部分相对应的文本嵌入向量的比例。
Weighting prompts支持从
prompt = "a red cat playing with a ball"
变为文章来源:https://www.toymoban.com/news/detail-466852.html
prompt = "a red cat playing with a ball++"
以强调某个prompt。文章来源地址https://www.toymoban.com/news/detail-466852.html
到了这里,关于调研:huggingface-diffusers的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!