theme

colorSchema

background

title

info

author

keywords

favicon

export

lineNumbers

class

highlighter

drawings

transition

mdc

fonts

seriph

light

backgrounds/understand-sd.webp

Understand Stable Diffusion from Code

This slide explains image generation using Latent Diffusion Models through source code. This slide explains the mechanism of image generation through the code of a library called parediffusers, which simplifies diffusers.

masaishi

Stable Diffusion

Diffusers

parediffusers

AI

ML

Generative Models

images/icon_tea_light.webp

format	timeout	dark	withClicks
pdf	30000	false	false

true

text-center

shiki

persist
true

slide-left

true

sans
Noto Serif JP, serif

Understand Stable Diffusion from Code

Prompt: Understand Stable Diffusion from code, cyberpunk theme, best quality, high resolution, concept art

title: Introduction

2. Masamune Ishihara

Computer Engineering Undergrad at University of California, Santa Cruz
I'm interested in AI/ML and GIS.

Likes:

Tea
Tennis

masaishi

@masaishi2001

masamune-ishihara

masaishi

level: 2 layout: center

Purpose

Introduce image generation process with code

level: 2 layout: center transition: fade

About

In writing this slide, I realized many things that I did not understand. I may have explained things incorrectly, so I would appreciate it if you could tell me if I am unclear or if I have made any mistakes by clicking on the link below.

understand-stable-diffusion-slidev

Issues: Please let me know if you find any mistakes.

Discussions: Please let me know if you have questions.

Pull Requests: Please let me know if you have any improvements.

level: 2 layout: center

About

The concept of this slide is to introduce the flow of image generation with code, so basically all the code in this slide can be actually run.

Repository list

understand-stable-diffusion-slidev: Repository of this slide.

understand-stable-diffusion-slidev-notebooks: Notebooks for generating sample images and gifs.

parediffusers: Simple library for generating images without using huggingface/diffusers.

layout: center title: Table of Contents

4. Flow of Image Generation

Prompt: Stable Diffusion, watercolor painting, best quality, high resolution

level: 2 layout: center

What is Stable Diffusion?

An image generation model based on the Latent Diffusion Model (LDM) developed by Stability AI.
It can be used for Text-to-Image, Image-to-Image.
It can be easily moved by using Diffusers.
https://arxiv.org/abs/2112.10752

level: 2 layout: center

What is Diffusers?

Library for Diffusion Models developed by Hugging Face🤗.
Easy to run many image generation models.
https://github.com/huggingface/diffusers

level: 2 layout: image-right image: /exps/d-sd2-sample-42.webp

Diffusers

Install the Diffusers library:

!pip install transformers diffusers accelerate -U

Generate an image from text:

import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
  "stabilityai/stable-diffusion-2",
  dtype=torch.float16,
).to(device=torch.device("cuda"))
prompt = "painting depicting the sea, sunrise, ship, artstation, 4k, concept art"

image = pipe(prompt, width=512, height=512).images[0]
display(image)

level: 2 layout: center

Diffusers are highly flexible,

but understanding the code is difficult.

level: 2

diffusers/.../pipeline_stable_diffusion.py

level: 2

parediffusers/.../pipeline.py

level: 2 layout: image-right image: /exps/p-sd2-sample-43.webp

PareDiffusers

Install the PareDiffusers library:

!pip install parediffusers

Generate an image from text:

import torch
from parediffusers import PareDiffusionPipeline

pipe = PareDiffusionPipeline.from_pretrained(
  "stabilityai/stable-diffusion-2",
  device=torch.device("cuda"),
  dtype=torch.float16,
)
prompt = "painting depicting the sea, sunrise, ship, artstation, 4k, concept art"

image = pipe(prompt, width=512, height=512)
display(image)

level: 2 layout: center

How is image generation performed?

level: 2 layout: center transition: fade

pipeline.py#L117-L135

```python {all}
image = pipe(prompt, width=512, height=512)
```
```python {all}
def __call__(self, prompt: str, height: int = 512, width: int = 512, ...):
	prompt_embeds = self.encode_prompt(prompt)
	latents = self.get_latent(width, height).unsqueeze(dim=0)
	latents = self.denoise(latents, prompt_embeds, ...)
	image = self.vae_decode(latents)
	return image
```
```md {all}
1. `encode_prompt` : Convert prompt to embedding.
2. `get_latent` : Create random Latent.
3. `denoise` : Denosing by using Scheduler and UNet.
4. `vae_decode` : Decode to pixel space by VAE.
```

level: 2 layout: center

1. `encode_prompt` : Convert prompt to embedding.
2. `get_latent` : Create random Latent.
3. `denoise` : Denosing by using Scheduler and UNet.
4. `vae_decode` : Decode to pixel space by VAE.

level: 2 layout: center

A Briefly Theory

level: 2 layout: center

What is Latent Diffusion Model (LDM)?

Models Which Run DDPM on Latent Space

level: 2

What is Denoising Diffusion Probabilistic Model (DDPM)?

Adding noise to the image and restoring the original image from it.

It is used for audio and other data in general, but this slide will discuss images.

Diffusion process is used to preprocess training data. Stochastic process（Markov chain)
Reverse process is used to recover the original data from the noisy data.

Jonathan Ho, Ajay Jain, Pieter Abbeel: “Denoising Diffusion Probabilistic Models”, 2020; arXiv:2006.11239.

level: 2 layout: center

It is interesting that it is called a Diffusion Model

even if diffusion is not NN just processing.

level: 2 layout: center

What is Latent Diffusion Model (LDM)?

Models Which Run DDPM on Latent Space

level: 2 layout: center transition: fade

Difference of Loss Functions

$$ L_{DM} := \mathbb{E}_{x, \epsilon \sim \mathcal{N}(0, 1), t}\Big[ \Vert \epsilon - \epsilon_\theta(x_{t},t) \Vert_{2}^{2}\Big] \, . $$

$$ L_{LDM} := \mathbb{E}_{\mathcal{E}(x), \epsilon \sim \mathcal{N}(0, 1), t}\Big[ \Vert \epsilon - \epsilon_\theta(z_{t},t) \Vert_{2}^{2}\Big] \, . $$

level: 2 layout: center transition: fade

Latent Diffusion Model (LDM)

$$ L_{LDM} := \mathbb{E}_{\mathcal{E}(x), \epsilon \sim \mathcal{N}(0, 1), t}\Big[ \Vert \epsilon - \epsilon_\theta(z_{t},t) \Vert_{2}^{2}\Big] \, . $$

level: 2 layout: center

Latent Diffusion Model (LDM) with Conditioning

$$ L_{LDM} := \mathbb{E}_{\mathcal{E}(x), y, \epsilon \sim \mathcal{N}(0, 1), t }\Big[ \Vert \epsilon - \epsilon_\theta(z_{t},t, \tau_\theta(y)) \Vert_{2}^{2}\Big] \, , $$

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right) \cdot V $$

$$ \begin{equation*} Q = W^{(i)}_Q \cdot \varphi_i(z_t), ; K = W^{(i)}K \cdot \tau\theta(y), ; V = W^{(i)}V \cdot \tau\theta(y) . \nonumber % \end{equation*} $$

level: 2 transition: fade

Jonathan Ho, Ajay Jain, Pieter Abbeel: “Denoising Diffusion Probabilistic Models”, 2020; arXiv:2006.11239.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer: “High-Resolution Image Synthesis with Latent Diffusion Models”, 2021; arXiv:2112.10752.

level: 2 layout: center

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer: “High-Resolution Image Synthesis with Latent Diffusion Models”, 2021; arXiv:2112.10752.

level: 2 layout: center

What is Latent Space?

Features of the Input Image are Extracted

level: 2 layout: center transition: fade

The flow of image generation in 4 steps

Step 1: Convert prompt to embedding.
Step 2: Create random Latent.
Step 3: Denosing by using Scheduler and UNet.
Step 4: Decode to pixel space by VAE.

level: 2 layout: center

The flow of image generation in 4 steps

Step 1: encode_prompt
Step 2: get_latent
Step 3: denoise
Step 4: vae_decode

layout: cover title: "Step 1: encode_prompt" background: /backgrounds/pipeline.webp

Step 1: encode_prompt

Prompt: Pipeline, cyberpunk theme, best quality, high resolution, concept art

level: 2 layout: center transition: fade

Step 1: encode_prompt

Convert prompt to embedding.

level: 2 layout: center

Step 1: encode_prompt

Convert prompts into a form that is easy for the model to handle.

level: 2 layout: center

Necessities

- CLIPTokenizer

- CLIPTextModel

From huggingface/transformers

level: 2 layout: two-cols transition: fade

Step 1: encode_prompt

Calling another function twice within encode_prompt

::right::

pipeline.py#L41-L48

def encode_prompt(self, prompt: str):
	"""
	Encode the text prompt into embeddings using the text encoder.
	"""
	prompt_embeds = self.get_embes(prompt, self.tokenizer.model_max_length)
	negative_prompt_embeds = self.get_embes([''], prompt_embeds.shape[1])
	prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])
	return prompt_embeds

level: 2 layout: two-cols transition: fade

Step 1: encode_prompt

Where are Necessities used?

::right::

pipeline.py#L41-L57

def encode_prompt(self, prompt: str):
	"""
	Encode the text prompt into embeddings using the text encoder.
	"""
	prompt_embeds = self.get_embes(prompt, self.tokenizer.model_max_length)
	negative_prompt_embeds = self.get_embes([''], prompt_embeds.shape[1])
	prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])
	return prompt_embeds
 
def get_embes(self, prompt, max_length):
	"""
	Encode the text prompt into embeddings using the text encoder.
	"""
	text_inputs = self.tokenizer(prompt, padding="max_length", max_length=max_length, truncation=True, return_tensors="pt")
	text_input_ids = text_inputs.input_ids.to(self.device)
	prompt_embeds = self.text_encoder(text_input_ids)[0].to(dtype=self.dtype, device=self.device)
	return prompt_embeds

level: 2 layout: two-cols transition: fade

Step 1: encode_prompt

Where are Necessities used?

L54: CLIPTokenizer: Token into text (Prompt). By making it a vector, it makes it easier to handle AI.
L56: CLIPTextModel: Multi -modal model of language and image. In the image generation, the expression (embedding) of the image you want to make at the prompt is extracted.

::right::

pipeline.py#L21-L39

@classmethod
def from_pretrained(cls, model_name, device=torch.device("cuda"), dtype=torch.float16):
	# Ommit comments
	tokenizer = CLIPTokenizer.from_pretrained(model_name, subfolder="tokenizer")
	text_encoder = CLIPTextModel.from_pretrained(model_name, subfolder="text_encoder")
	scheduler = PareDDIMScheduler.from_config(model_name, subfolder="scheduler")
	unet = PareUNet2DConditionModel.from_pretrained(model_name, subfolder="unet")
	vae = PareAutoencoderKL.from_pretrained(model_name, subfolder="vae")
	return cls(tokenizer, text_encoder, scheduler, unet, vae, device, dtype)

pipeline.py#L50-L57

def get_embes(self, prompt, max_length):
	"""
	Encode the text prompt into embeddings using the text encoder.
	"""
	text_inputs = self.tokenizer(prompt, padding="max_length", max_length=max_length, truncation=True, return_tensors="pt")
	text_input_ids = text_inputs.input_ids.to(self.device)
	prompt_embeds = self.text_encoder(text_input_ids)[0].to(dtype=self.dtype, device=self.device)
	return prompt_embeds

level: 2 layout: two-cols

Step 1: encode_prompt

Understand the whole flow

L54: CLIPTokenizer: Token into text (Prompt). By making it a vector, it makes it easier to handle AI.
L56: CLIPTextModel: Multi -modal model of language and image. In the image generation, the expression (embedding) of the image you want to make at the prompt is extracted.

L46: Negative_prompt is an empty character string to make it simple.

::right::

pipeline.py#L34-L35

	tokenizer = CLIPTokenizer.from_pretrained(model_name, subfolder="tokenizer")
	text_encoder = CLIPTextModel.from_pretrained(model_name, subfolder="text_encoder")

pipeline.py#L41-L57

def encode_prompt(self, prompt: str):
	"""
	Encode the text prompt into embeddings using the text encoder.
	"""
	prompt_embeds = self.get_embes(prompt, self.tokenizer.model_max_length)
	negative_prompt_embeds = self.get_embes([''], prompt_embeds.shape[1])
	prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])
	return prompt_embeds
 
def get_embes(self, prompt, max_length):
	"""
	Encode the text prompt into embeddings using the text encoder.
	"""
	text_inputs = self.tokenizer(prompt, padding="max_length", max_length=max_length, truncation=True, return_tensors="pt")
	text_input_ids = text_inputs.input_ids.to(self.device)
	prompt_embeds = self.text_encoder(text_input_ids)[0].to(dtype=self.dtype, device=self.device)
	return prompt_embeds

level: 2

level: 2

level: 2 layout: center

layout: cover title: "Step 2: get_latent" background: /backgrounds/scheduler.webp

Step 2: get_latent

Prompt: Scheduler, flat vector illustration, best quality, high resolution

level: 2 layout: center

Step 2: get_latent

Generate random tensor of 1/8 size

level: 2 layout: center

Necessities

torch.randn

level: 2 layout: custom-two-cols leftPercent: 0.4

Step 2: get_latent

Understand the whole flow

L63: Generate random tensor of 1/8 size

::right::

pipeline.py#L59-L65

def get_latent(self, width: int, height: int):
	"""
	Generate a random initial latent tensor to start the diffusion process.
	"""
	return torch.randn((4, width // 8, height // 8)).to(
		device=self.device, dtype=self.dtype
	)

layout: cover title: "Step 3: denoise" background: /backgrounds/unet.webp

Step 3: denoise

Prompt: UNet, watercolor painting, detailed, brush strokes, best quality, high resolution

level: 2 layout: center

Step 3: denoise

Denosing by using Scheduler and UNet.

level: 2 layout: center

understand-stable-diffusion-slidev-notebooks/denoise.ipynb

level: 2 layout: center

understand-stable-diffusion-slidev-notebooks/denoise.ipynb

level: 2 layout: center

Necessities

scheduler.py

unet.py

level: 2 layout: center

Step 3: denoise

Aside from Necessities, the whole flow

level: 2 layout: custom-two-cols leftPercent: 0.5 transition: fade

Step 3: denoise

Where are Necessities used?

L86: UNet
L91: Scheduler

::right::

pipeline.py#L75-L93

@torch.no_grad()
def denoise(self, latents, prompt_embeds, num_inference_steps=50, guidance_scale=7.5):
	"""
	Iteratively denoise the latent space using the diffusion model to produce an image.
	"""
	timesteps, num_inference_steps = self.retrieve_timesteps(num_inference_steps)

	for t in timesteps:
		latent_model_input = torch.cat([latents] * 2)
		
		# Predict the noise residual for the current timestep
		noise_residual = self.unet(latent_model_input, t, encoder_hidden_states=prompt_embeds)
		uncond_residual, text_cond_residual = noise_residual.chunk(2)
		guided_noise_residual = uncond_residual + guidance_scale * (text_cond_residual - uncond_residual)

		# Update latents by reversing the diffusion process for the current timestep
		latents = self.scheduler.step(guided_noise_residual, t, latents)[0]

	return latents

level: 2 layout: custom-two-cols leftPercent: 0.5 transition: fade

Step 3: denoise

Where are Necessities used?

L86: UNet2DConditionModel
L91: DDIMScheduler

::right::

pipeline.py#L21-L39

@classmethod
def from_pretrained(cls, model_name, device=torch.device("cuda"), dtype=torch.float16):
	# Ommit comments
	tokenizer = CLIPTokenizer.from_pretrained(model_name, subfolder="tokenizer")
	text_encoder = CLIPTextModel.from_pretrained(model_name, subfolder="text_encoder")
	scheduler = PareDDIMScheduler.from_config(model_name, subfolder="scheduler")
	unet = PareUNet2DConditionModel.from_pretrained(model_name, subfolder="unet")
	vae = PareAutoencoderKL.from_pretrained(model_name, subfolder="vae")
	return cls(tokenizer, text_encoder, scheduler, unet, vae, device, dtype)

pipeline.py#L82-L93

	for t in timesteps:
		latent_model_input = torch.cat([latents] * 2)
		
		# Predict the noise residual for the current timestep
		noise_residual = self.unet(latent_model_input, t, encoder_hidden_states=prompt_embeds)
		uncond_residual, text_cond_residual = noise_residual.chunk(2)
		guided_noise_residual = uncond_residual + guidance_scale * (text_cond_residual - uncond_residual)

		# Update latents by reversing the diffusion process for the current timestep
		latents = self.scheduler.step(guided_noise_residual, t, latents)[0]

	return latents

level: 2 layout: custom-two-cols leftPercent: 0.5

Step 3: denoise

Understand the whole flow

L80: Acquisition of timesteps using Scheduler
(Scheduler will be described later)
L82: timesteps length loop
(timesteps length = num_inference_steps)
L86: Denose by UNet
(UNet will be described later)
L88: Calculate how much considering the prompt
(Reference: Jonathan Ho, Tim Salimans: “Classifier-Free Diffusion Guidance”, 2022; arXiv:2207.12598.)
L91: The strength of the deny is determined by Scheduler.

::right::

pipeline.py#L82-L93

@torch.no_grad()
def denoise(self, latents, prompt_embeds, num_inference_steps=50, guidance_scale=7.5):
	"""
	Iteratively denoise the latent space using the diffusion model to produce an image.
	"""
	timesteps, num_inference_steps = self.retrieve_timesteps(num_inference_steps)

	for t in timesteps:
		latent_model_input = torch.cat([latents] * 2)
		
		# Predict the noise residual for the current timestep
		noise_residual = self.unet(latent_model_input, t, encoder_hidden_states=prompt_embeds)
		uncond_residual, text_cond_residual = noise_residual.chunk(2)
		guided_noise_residual = uncond_residual + guidance_scale * (text_cond_residual - uncond_residual)

		# Update latents by reversing the diffusion process for the current timestep
		latents = self.scheduler.step(guided_noise_residual, t, latents)[0]

	return latents

level: 2

scheduler.py

Determine the strength
of denoising

level: 2 layout: custom-two-cols leftPercent: 0.5

Scheduler

L49: Get alpha_prod_t(0~1.0)
(Indicates how much the original data is retained.)
L50: Get alpha_prod_t_prev(0~1.0)
L52: alpha_prod_t + beta_prod_t = 1
L53: Estimate the original sample from the current sample and model output.
L54: Calculate the estimated value of the added noise.
L56: Calculate the direction to restore it to the original image.
L57: Calculate the sample that goes one step further in the denoising by using 3 values.

::right::

scheduler.py#L40-L59

def step(
	self,
	model_output: torch.FloatTensor,
	timestep: int,
	sample: torch.FloatTensor,
) -> list:
	"""Perform a single step of denoising in the diffusion process."""
	prev_timestep = timestep - self.config.num_train_timesteps // self.num_inference_steps

	alpha_prod_t = self.alphas_cumprod[timestep]
	alpha_prod_t_prev = self.alphas_cumprod[prev_timestep] if prev_timestep >= 0 else self.final_alpha_cumprod

	beta_prod_t = 1 - alpha_prod_t
	pred_original_sample = (alpha_prod_t**0.5) * sample - (beta_prod_t**0.5) * model_output
	pred_epsilon = (alpha_prod_t**0.5) * model_output + (beta_prod_t**0.5) * sample

	pred_sample_direction = (1 - alpha_prod_t_prev) ** (0.5) * pred_epsilon
	prev_sample = alpha_prod_t_prev ** (0.5) * pred_original_sample + pred_sample_direction

	return prev_sample, pred_original_sample

level: 2

scheduler.py#L40-L59

def step(
	self,
	model_output: torch.FloatTensor,
	timestep: int,
	sample: torch.FloatTensor,
) -> list:
	"""Perform a single step of denoising in the diffusion process."""
	prev_timestep = timestep - self.config.num_train_timesteps // self.num_inference_steps

	alpha_prod_t = self.alphas_cumprod[timestep]
	alpha_prod_t_prev = self.alphas_cumprod[prev_timestep] if prev_timestep >= 0 else self.final_alpha_cumprod

	beta_prod_t = 1 - alpha_prod_t
	pred_original_sample = (alpha_prod_t**0.5) * sample - (beta_prod_t**0.5) * model_output
	pred_epsilon = (alpha_prod_t**0.5) * model_output + (beta_prod_t**0.5) * sample

	pred_sample_direction = (1 - alpha_prod_t_prev) ** (0.5) * pred_epsilon
	prev_sample = alpha_prod_t_prev ** (0.5) * pred_original_sample + pred_sample_direction

	return prev_sample, pred_original_sample

level: 2 layout: center transition: fade

understand-stable-diffusion-slidev-notebooks/scheduler.ipynb

level: 2 layout: center transition: fade

−

+

understand-stable-diffusion-slidev-notebooks/scheduler.ipynb

level: 2 layout: center

understand-stable-diffusion-slidev-notebooks/scheduler.ipynb

level: 2

level: 2 layout: center

I don't have any idea why `ratio = 1.5` looks good.

understand-stable-diffusion-slidev-notebooks/scheduler_necessity.ipynb

level: 2

unet.py

Using for Denoising

level: 2 layout: image image: /images/unet-figure.webp backgroundSize: 70% class: 'text-black'

Olaf Ronneberger, Philipp Fischer, Thomas Brox: “U-Net: Convolutional Networks for Biomedical Image Segmentation”, 2015; arXiv:1505.04597.

level: 2

Create UNet using Resnet and Transformer

layout: cover title: "Step 4: vae_decode" background: /backgrounds/vae.webp

Step 4: vae_decode

Prompt: VAE, abstract style, highly detailed, colors and shapes

level: 2 layout: center

Step 4: vae_decode

Decode into the image with VAE

level: 2 layout: custom-two-cols leftPercent: 0.4

Step 4: vae_decode

Understand the whole flow

L112: Decode into the image space.

L113: Normalization is performed during training, so denormalize it.

L114: Convert from tensor to Pil.image

::right::

pipeline.py#L107-L105

@torch.no_grad()
def vae_decode(self, latents):
	"""
	Decode the latent tensors using the VAE to produce an image.
	"""
	image = self.vae.decode(latents / self.vae.config.scaling_factor)[0]
	image = self.denormalize(image)
	image = self.tensor_to_image(image)
	return image

level: 2 layout: center

Variational Autoencoder (VAE)

level: 2

vae.py

level: 2

[1, 512, 64, 64]

[1, 512, 128, 128]

[1, 512, 256, 256]

[1, 256, 512, 512]

[1, 128, 512, 512]

understand-stable-diffusion-slidev-notebooks/vae.ipynb

layout: cover title: Conclusion background: /backgrounds/summary.webp

9. Conclusion

Prompt: Summary, long-exposure photography, masterpieces

level: 2 layout: center

It's fun to read the library code!

level: 2 layout: center

A dissertation quoted everywhere in the library

diffusers/.../pipeline.py

level: 2 layout: center

Conclusion

Step 1: Convert prompt to embedding.
Step 2: Create random Latent.
Step 3: Denosing by using Scheduler and UNet.
Step 4: Decode to pixel space by VAE.

level: 2 layout: center

Conclusion

pipeline.py#L117-L135

def __call__(self, prompt: str, height: int = 512, width: int = 512, ...):
	prompt_embeds = self.encode_prompt(prompt)
	latents = self.get_latent(width, height).unsqueeze(dim=0)
	latents = self.denoise(latents, prompt_embeds, ...)
	image = self.vae_decode(latents)
	return image

level: 1 layout: center

Appendix

level: 2 layout: center

Other denoising samples

understand-stable-diffusion-slidev-notebooks/denoise.ipynb

level: 2 layout: center

Other decorded denoising samples

understand-stable-diffusion-slidev-notebooks/denoise.ipynb

level: 2 layout: center

UNet is really U form?

init             torch.Size([2, 4, 64, 64])
conv_in          torch.Size([2, 320, 64, 64])

down_blocks_0    torch.Size([2, 320, 32, 32])
down_blocks_1    torch.Size([2, 640, 16, 16])
down_blocks_2    torch.Size([2, 1280, 8, 8])
down_blocks_3    torch.Size([2, 1280, 8, 8])

mid_block        torch.Size([2, 1280, 8, 8])

up_blocks0       torch.Size([2, 1280, 16, 16])
up_blocks1       torch.Size([2, 1280, 32, 32])
up_blocks2       torch.Size([2, 640, 64, 64])
up_blocks3       torch.Size([2, 320, 64, 64])

conv_out         torch.Size([2, 4, 64, 64])

understand-stable-diffusion-slidev-notebooks/unet.ipynb

Files

slides.md

Latest commit

History

slides.md

File metadata and controls

Understand Stable Diffusion from Code

title: Introduction

2. Masamune Ishihara

Likes:

level: 2 layout: center

Introduce image generation process with code

level: 2 layout: center transition: fade

About

level: 2 layout: center

About

Repository list

layout: center title: Table of Contents

Table of Contents

layout: cover title: Flow of Image Generation background: /backgrounds/stable-diffusion.webp

4. Flow of Image Generation

level: 2 layout: center

What is Stable Diffusion?

level: 2 layout: center

What is Diffusers?

level: 2 layout: image-right image: /exps/d-sd2-sample-42.webp

level: 2 layout: center

Diffusers are highly flexible,

but understanding the code is difficult.

level: 2

level: 2

level: 2 layout: image-right image: /exps/p-sd2-sample-43.webp

level: 2 layout: center

How is image generation performed?

level: 2 layout: center transition: fade

level: 2 layout: center

level: 2 layout: center

A Briefly Theory

level: 2 layout: center

Models Which Run DDPM on Latent Space

level: 2

Adding noise to the image and restoring the original image from it.

level: 2 layout: center

It is interesting that it is called a Diffusion Model

even if diffusion is not NN just processing.

level: 2 layout: center

Models Which Run DDPM on Latent Space

level: 2 layout: center transition: fade

$$ L_{DM} := \mathbb{E}_{x, \epsilon \sim \mathcal{N}(0, 1), t}\Big[ \Vert \epsilon - \epsilon_\theta(x_{t},t) \Vert_{2}^{2}\Big] \, . $$

$$ L_{LDM} := \mathbb{E}_{\mathcal{E}(x), \epsilon \sim \mathcal{N}(0, 1), t}\Big[ \Vert \epsilon - \epsilon_\theta(z_{t},t) \Vert_{2}^{2}\Big] \, . $$

level: 2 layout: center transition: fade

$$ L_{LDM} := \mathbb{E}_{\mathcal{E}(x), \epsilon \sim \mathcal{N}(0, 1), t}\Big[ \Vert \epsilon - \epsilon_\theta(z_{t},t) \Vert_{2}^{2}\Big] \, . $$

level: 2 layout: center

$$ L_{LDM} := \mathbb{E}_{\mathcal{E}(x), y, \epsilon \sim \mathcal{N}(0, 1), t }\Big[ \Vert \epsilon - \epsilon_\theta(z_{t},t, \tau_\theta(y)) \Vert_{2}^{2}\Big] \, , $$

level: 2 transition: fade

level: 2 layout: center

level: 2 layout: center

Features of the Input Image are Extracted

level: 2 layout: center transition: fade

Step 1: Convert prompt to embedding. Step 2: Create random Latent. Step 3: Denosing by using Scheduler and UNet. Step 4: Decode to pixel space by VAE.

level: 2 layout: center

Step 1: encode_prompt Step 2: get_latent Step 3: denoise Step 4: vae_decode

layout: cover title: "Step 1: encode_prompt" background: /backgrounds/pipeline.webp

Step 1: encode_prompt

level: 2 layout: center transition: fade

Convert prompt to embedding.

level: 2 layout: center

Convert prompts into a form that is easy for the model to handle.

level: 2 layout: center

- CLIPTokenizer

- CLIPTextModel

level: 2 layout: two-cols transition: fade

Step 1: encode_prompt

level: 2 layout: two-cols transition: fade

Step 1: encode_prompt

level: 2 layout: two-cols transition: fade

Step 1: encode_prompt

level: 2 layout: two-cols

Step 1: encode_prompt

level: 2

Step 1: Convert prompt to embedding.
Step 2: Create random Latent.
Step 3: Denosing by using Scheduler and UNet.
Step 4: Decode to pixel space by VAE.

Step 1: encode_prompt
Step 2: get_latent
Step 3: denoise
Step 4: vae_decode

I don't have any idea why `ratio = 1.5` looks good.