Skip to content

Latest commit

 

History

History
1477 lines (1094 loc) · 44.8 KB

File metadata and controls

1477 lines (1094 loc) · 44.8 KB
theme colorSchema background title info author keywords favicon export lineNumbers class highlighter drawings transition mdc fonts
seriph
light
backgrounds/understand-sd.webp
Understand Stable Diffusion from Code
This slide explains image generation using Latent Diffusion Models through source code. This slide explains the mechanism of image generation through the code of a library called parediffusers, which simplifies diffusers.
masaishi
Stable Diffusion
Diffusers
parediffusers
AI
ML
Generative Models
images/icon_tea_light.webp
format timeout dark withClicks
pdf
30000
false
false
true
text-center
shiki
persist
true
slide-left
true
sans
Noto Serif JP, serif

Understand Stable Diffusion from Code

Prompt: Understand Stable Diffusion from code, cyberpunk theme, best quality, high resolution, concept art


title: Introduction

2. Masamune Ishihara

Computer Engineering Undergrad at University of California, Santa Cruz
I'm interested in AI/ML and GIS.

Likes:

  • Tea
  • Tennis


level: 2 layout: center

Purpose

Introduce image generation process with code


level: 2 layout: center transition: fade

About

In writing this slide, I realized many things that I did not understand. I may have explained things incorrectly, so I would appreciate it if you could tell me if I am unclear or if I have made any mistakes by clicking on the link below.



Issues: Please let me know if you find any mistakes.

Discussions: Please let me know if you have questions.

Pull Requests: Please let me know if you have any improvements.


level: 2 layout: center

About

The concept of this slide is to introduce the flow of image generation with code, so basically all the code in this slide can be actually run.


Repository list

understand-stable-diffusion-slidev: Repository of this slide.

understand-stable-diffusion-slidev-notebooks: Notebooks for generating sample images and gifs.

parediffusers: Simple library for generating images without using huggingface/diffusers.


layout: center title: Table of Contents

Table of Contents


layout: cover title: Flow of Image Generation background: /backgrounds/stable-diffusion.webp

4. Flow of Image Generation

Prompt: Stable Diffusion, watercolor painting, best quality, high resolution


level: 2 layout: center

What is Stable Diffusion?

  • An image generation model based on the Latent Diffusion Model (LDM) developed by Stability AI.
  • It can be used for Text-to-Image, Image-to-Image.
  • It can be easily moved by using Diffusers.
  • https://arxiv.org/abs/2112.10752

level: 2 layout: center

What is Diffusers?


level: 2 layout: image-right image: /exps/d-sd2-sample-42.webp

Open In Colab

Install the Diffusers library:

!pip install transformers diffusers accelerate -U

Generate an image from text:

import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
  "stabilityai/stable-diffusion-2",
  dtype=torch.float16,
).to(device=torch.device("cuda"))
prompt = "painting depicting the sea, sunrise, ship, artstation, 4k, concept art"

image = pipe(prompt, width=512, height=512).images[0]
display(image)

level: 2 layout: center

Diffusers are highly flexible,

but understanding the code is difficult.


level: 2

<iframe frameborder="0" scrolling="yes" class="overflow-scroll mt-10" style="width:100%; height:85%;" allow="clipboard-write" src="https://emgithub.com/iframe.html?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fdiffusers%2Fblob%2Fmain%2Fsrc%2Fdiffusers%2Fpipelines%2Fstable_diffusion%2Fpipeline_stable_diffusion.py&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></iframe>

level: 2

<iframe frameborder="0" scrolling="yes" class="overflow-scroll mt-10" style="width:100%; height:85%;" allow="clipboard-write" src="https://emgithub.com/iframe.html?target=https%3A%2F%2Fgithub.com%2Fmasaishi%2Fparediffusers%2Fblob%2Fmain%2Fsrc%2Fparediffusers%2Fpipeline.py&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></iframe>

level: 2 layout: image-right image: /exps/p-sd2-sample-43.webp

Open In Colab

Install the PareDiffusers library:

!pip install parediffusers

Generate an image from text:

import torch
from parediffusers import PareDiffusionPipeline

pipe = PareDiffusionPipeline.from_pretrained(
  "stabilityai/stable-diffusion-2",
  device=torch.device("cuda"),
  dtype=torch.float16,
)
prompt = "painting depicting the sea, sunrise, ship, artstation, 4k, concept art"

image = pipe(prompt, width=512, height=512)
display(image)

level: 2 layout: center

How is image generation performed?


level: 2 layout: center transition: fade

```python {all}
image = pipe(prompt, width=512, height=512)
```
```python {all}
def __call__(self, prompt: str, height: int = 512, width: int = 512, ...):
	prompt_embeds = self.encode_prompt(prompt)
	latents = self.get_latent(width, height).unsqueeze(dim=0)
	latents = self.denoise(latents, prompt_embeds, ...)
	image = self.vae_decode(latents)
	return image
```
```md {all}
1. `encode_prompt` : Convert prompt to embedding.
2. `get_latent` : Create random Latent.
3. `denoise` : Denosing by using Scheduler and UNet.
4. `vae_decode` : Decode to pixel space by VAE.
```

level: 2 layout: center

1. `encode_prompt` : Convert prompt to embedding.
2. `get_latent` : Create random Latent.
3. `denoise` : Denosing by using Scheduler and UNet.
4. `vae_decode` : Decode to pixel space by VAE.


level: 2 layout: center

A Briefly Theory


level: 2 layout: center

What is Latent Diffusion Model (LDM)?

Models Which Run DDPM on Latent Space


level: 2

What is Denoising Diffusion Probabilistic Model (DDPM)?

Adding noise to the image and restoring the original image from it.

It is used for audio and other data in general, but this slide will discuss images.

  • Diffusion process is used to preprocess training data. Stochastic process(Markov chain)
  • Reverse process is used to recover the original data from the noisy data.

Jonathan Ho, Ajay Jain, Pieter Abbeel: “Denoising Diffusion Probabilistic Models”, 2020; arXiv:2006.11239.


level: 2 layout: center

It is interesting that it is called a Diffusion Model

even if diffusion is not NN just processing.


level: 2 layout: center

What is Latent Diffusion Model (LDM)?

Models Which Run DDPM on Latent Space


level: 2 layout: center transition: fade

Difference of Loss Functions

$$ L_{DM} := \mathbb{E}_{x, \epsilon \sim \mathcal{N}(0, 1), t}\Big[ \Vert \epsilon - \epsilon_\theta(x_{t},t) \Vert_{2}^{2}\Big] \, . $$

$$ L_{LDM} := \mathbb{E}_{\mathcal{E}(x), \epsilon \sim \mathcal{N}(0, 1), t}\Big[ \Vert \epsilon - \epsilon_\theta(z_{t},t) \Vert_{2}^{2}\Big] \, . $$


level: 2 layout: center transition: fade

Latent Diffusion Model (LDM)

$$ L_{LDM} := \mathbb{E}_{\mathcal{E}(x), \epsilon \sim \mathcal{N}(0, 1), t}\Big[ \Vert \epsilon - \epsilon_\theta(z_{t},t) \Vert_{2}^{2}\Big] \, . $$


level: 2 layout: center

Latent Diffusion Model (LDM) with Conditioning

$$ L_{LDM} := \mathbb{E}_{\mathcal{E}(x), y, \epsilon \sim \mathcal{N}(0, 1), t }\Big[ \Vert \epsilon - \epsilon_\theta(z_{t},t, \tau_\theta(y)) \Vert_{2}^{2}\Big] \, , $$

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right) \cdot V $$

$$ \begin{equation*} Q = W^{(i)}_Q \cdot \varphi_i(z_t), ; K = W^{(i)}K \cdot \tau\theta(y), ; V = W^{(i)}V \cdot \tau\theta(y) . \nonumber % \end{equation*} $$


level: 2 transition: fade

Jonathan Ho, Ajay Jain, Pieter Abbeel: “Denoising Diffusion Probabilistic Models”, 2020; arXiv:2006.11239.

Stable Diffusion Figure

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer: “High-Resolution Image Synthesis with Latent Diffusion Models”, 2021; arXiv:2112.10752.


level: 2 layout: center

<iframe frameborder="0" scrolling="no" style="width:100%; height:163px;" allow="clipboard-write" src="https://emgithub.com/iframe.html?target=https%3A%2F%2Fgithub.com%2Fmasaishi%2Fparediffusers%2Fblob%2F035772c684ae8d16c7c908f185f6413b72658126%2Fsrc%2Fparediffusers%2Fpipeline.py%23L131-L134&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></iframe>
Stable Diffusion Figure

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer: “High-Resolution Image Synthesis with Latent Diffusion Models”, 2021; arXiv:2112.10752.


level: 2 layout: center

What is Latent Space?

Features of the Input Image are Extracted


level: 2 layout: center transition: fade

The flow of image generation in 4 steps

Step 1: Convert prompt to embedding.
Step 2: Create random Latent.
Step 3: Denosing by using Scheduler and UNet.
Step 4: Decode to pixel space by VAE.


level: 2 layout: center

The flow of image generation in 4 steps

Step 1: encode_prompt
Step 2: get_latent
Step 3: denoise
Step 4: vae_decode


layout: cover title: "Step 1: encode_prompt" background: /backgrounds/pipeline.webp

Step 1: encode_prompt

Prompt: Pipeline, cyberpunk theme, best quality, high resolution, concept art


level: 2 layout: center transition: fade

Step 1: encode_prompt

Convert prompt to embedding.


level: 2 layout: center

Step 1: encode_prompt

Convert prompts into a form that is easy for the model to handle.


level: 2 layout: center

Necessities

From huggingface/transformers


level: 2 layout: two-cols transition: fade

Step 1: encode_prompt

Calling another function twice within encode_prompt

::right::

pipeline.py#L41-L48

def encode_prompt(self, prompt: str):
	"""
	Encode the text prompt into embeddings using the text encoder.
	"""
	prompt_embeds = self.get_embes(prompt, self.tokenizer.model_max_length)
	negative_prompt_embeds = self.get_embes([''], prompt_embeds.shape[1])
	prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])
	return prompt_embeds

level: 2 layout: two-cols transition: fade

Step 1: encode_prompt

Where are Necessities used?

::right::

pipeline.py#L41-L57

def encode_prompt(self, prompt: str):
	"""
	Encode the text prompt into embeddings using the text encoder.
	"""
	prompt_embeds = self.get_embes(prompt, self.tokenizer.model_max_length)
	negative_prompt_embeds = self.get_embes([''], prompt_embeds.shape[1])
	prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])
	return prompt_embeds
 
def get_embes(self, prompt, max_length):
	"""
	Encode the text prompt into embeddings using the text encoder.
	"""
	text_inputs = self.tokenizer(prompt, padding="max_length", max_length=max_length, truncation=True, return_tensors="pt")
	text_input_ids = text_inputs.input_ids.to(self.device)
	prompt_embeds = self.text_encoder(text_input_ids)[0].to(dtype=self.dtype, device=self.device)
	return prompt_embeds

level: 2 layout: two-cols transition: fade

Step 1: encode_prompt

Where are Necessities used?

  • L54: CLIPTokenizer: Token into text (Prompt). By making it a vector, it makes it easier to handle AI.

  • L56: CLIPTextModel: Multi -modal model of language and image. In the image generation, the expression (embedding) of the image you want to make at the prompt is extracted.

::right::

pipeline.py#L21-L39

@classmethod
def from_pretrained(cls, model_name, device=torch.device("cuda"), dtype=torch.float16):
	# Ommit comments
	tokenizer = CLIPTokenizer.from_pretrained(model_name, subfolder="tokenizer")
	text_encoder = CLIPTextModel.from_pretrained(model_name, subfolder="text_encoder")
	scheduler = PareDDIMScheduler.from_config(model_name, subfolder="scheduler")
	unet = PareUNet2DConditionModel.from_pretrained(model_name, subfolder="unet")
	vae = PareAutoencoderKL.from_pretrained(model_name, subfolder="vae")
	return cls(tokenizer, text_encoder, scheduler, unet, vae, device, dtype)

pipeline.py#L50-L57

def get_embes(self, prompt, max_length):
	"""
	Encode the text prompt into embeddings using the text encoder.
	"""
	text_inputs = self.tokenizer(prompt, padding="max_length", max_length=max_length, truncation=True, return_tensors="pt")
	text_input_ids = text_inputs.input_ids.to(self.device)
	prompt_embeds = self.text_encoder(text_input_ids)[0].to(dtype=self.dtype, device=self.device)
	return prompt_embeds

level: 2 layout: two-cols

Step 1: encode_prompt

Understand the whole flow

  • L54: CLIPTokenizer: Token into text (Prompt). By making it a vector, it makes it easier to handle AI.

  • L56: CLIPTextModel: Multi -modal model of language and image. In the image generation, the expression (embedding) of the image you want to make at the prompt is extracted.

  • L46: Negative_prompt is an empty character string to make it simple.

::right::

pipeline.py#L34-L35

	tokenizer = CLIPTokenizer.from_pretrained(model_name, subfolder="tokenizer")
	text_encoder = CLIPTextModel.from_pretrained(model_name, subfolder="text_encoder")

pipeline.py#L41-L57

def encode_prompt(self, prompt: str):
	"""
	Encode the text prompt into embeddings using the text encoder.
	"""
	prompt_embeds = self.get_embes(prompt, self.tokenizer.model_max_length)
	negative_prompt_embeds = self.get_embes([''], prompt_embeds.shape[1])
	prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])
	return prompt_embeds
 
def get_embes(self, prompt, max_length):
	"""
	Encode the text prompt into embeddings using the text encoder.
	"""
	text_inputs = self.tokenizer(prompt, padding="max_length", max_length=max_length, truncation=True, return_tensors="pt")
	text_input_ids = text_inputs.input_ids.to(self.device)
	prompt_embeds = self.text_encoder(text_input_ids)[0].to(dtype=self.dtype, device=self.device)
	return prompt_embeds

level: 2

<iframe frameborder="0" scrolling="no" class="emg-iframe-text-inputs" allow="clipboard-write" src="https://emgithub.com/iframe.html?target=https%3A%2F%2Fgithub.com%2Fmasaishi%2Funderstand-stable-diffusion-slidev-notebooks%2Fblob%2Fmain%2Fembed%2Fch5-text_inputs.ipynb&style=github&type=ipynb&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></iframe> <style> .emg-iframe-text-inputs { transform: scale(0.9) translate(-50%, -50%); /* Apply both transformations */ transform-origin: top left; position: absolute; top: 50%; left: 50%; width: 100%; height: 100%; } </style>

level: 2

<iframe frameborder="0" scrolling="no" class="emg-iframe-prompt-embeds" allow="clipboard-write" src="https://emgithub.com/iframe.html?target=https%3A%2F%2Fgithub.com%2Fmasaishi%2Funderstand-stable-diffusion-slidev-notebooks%2Fblob%2Fmain%2Fembed%2Fch5-prompt_embeds.ipynb&style=github&type=ipynb&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></iframe> <style> .emg-iframe-prompt-embeds { transform: scale(0.8) translate(-50%, -50%); transform-origin: top left; position: absolute; top: 57%; left: 50%; width: 100%; height: 130%; } </style>

level: 2 layout: center

<iframe frameborder="0" scrolling="yes" class="overflow-scroll emg-iframe-play-prompt-embeds" allow="clipboard-write" src="https://emgithub.com/iframe.html?target=https%3A%2F%2Fgithub.com%2Fmasaishi%2Fparediffusers%2Fblob%2Fmain%2Fnotebooks%2Fch0.0.2_Play_prompt_embeds.ipynb&style=github&type=ipynb&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></iframe> <style> .emg-iframe-play-prompt-embeds { transform: scale(0.5) translate(-50%, -50%); /* Apply both transformations */ transform-origin: top left; position: absolute; top: 50%; left: 50%; width: 100%; height: 160%; } </style>

layout: cover title: "Step 2: get_latent" background: /backgrounds/scheduler.webp

Step 2: get_latent

Prompt: Scheduler, flat vector illustration, best quality, high resolution


level: 2 layout: center

Step 2: get_latent

Generate random tensor of 1/8 size


level: 2 layout: center

Necessities

torch.randn


level: 2 layout: custom-two-cols leftPercent: 0.4

Step 2: get_latent

Understand the whole flow

  • L63: Generate random tensor of 1/8 size

::right::

pipeline.py#L59-L65

def get_latent(self, width: int, height: int):
	"""
	Generate a random initial latent tensor to start the diffusion process.
	"""
	return torch.randn((4, width // 8, height // 8)).to(
		device=self.device, dtype=self.dtype
	)

layout: cover title: "Step 3: denoise" background: /backgrounds/unet.webp

Step 3: denoise

Prompt: UNet, watercolor painting, detailed, brush strokes, best quality, high resolution


level: 2 layout: center

Step 3: denoise

Denosing by using Scheduler and UNet.


level: 2 layout: center

understand-stable-diffusion-slidev-notebooks/denoise.ipynb


level: 2 layout: center

understand-stable-diffusion-slidev-notebooks/denoise.ipynb


level: 2 layout: center

Necessities


level: 2 layout: center

Step 3: denoise

Aside from Necessities, the whole flow


level: 2 layout: custom-two-cols leftPercent: 0.5 transition: fade

Step 3: denoise

Where are Necessities used?

  • L86: UNet

  • L91: Scheduler

::right::

pipeline.py#L75-L93

@torch.no_grad()
def denoise(self, latents, prompt_embeds, num_inference_steps=50, guidance_scale=7.5):
	"""
	Iteratively denoise the latent space using the diffusion model to produce an image.
	"""
	timesteps, num_inference_steps = self.retrieve_timesteps(num_inference_steps)

	for t in timesteps:
		latent_model_input = torch.cat([latents] * 2)
		
		# Predict the noise residual for the current timestep
		noise_residual = self.unet(latent_model_input, t, encoder_hidden_states=prompt_embeds)
		uncond_residual, text_cond_residual = noise_residual.chunk(2)
		guided_noise_residual = uncond_residual + guidance_scale * (text_cond_residual - uncond_residual)

		# Update latents by reversing the diffusion process for the current timestep
		latents = self.scheduler.step(guided_noise_residual, t, latents)[0]

	return latents

level: 2 layout: custom-two-cols leftPercent: 0.5 transition: fade

Step 3: denoise

Where are Necessities used?

  • L86: UNet2DConditionModel

  • L91: DDIMScheduler

::right::

pipeline.py#L21-L39

@classmethod
def from_pretrained(cls, model_name, device=torch.device("cuda"), dtype=torch.float16):
	# Ommit comments
	tokenizer = CLIPTokenizer.from_pretrained(model_name, subfolder="tokenizer")
	text_encoder = CLIPTextModel.from_pretrained(model_name, subfolder="text_encoder")
	scheduler = PareDDIMScheduler.from_config(model_name, subfolder="scheduler")
	unet = PareUNet2DConditionModel.from_pretrained(model_name, subfolder="unet")
	vae = PareAutoencoderKL.from_pretrained(model_name, subfolder="vae")
	return cls(tokenizer, text_encoder, scheduler, unet, vae, device, dtype)

pipeline.py#L82-L93

	for t in timesteps:
		latent_model_input = torch.cat([latents] * 2)
		
		# Predict the noise residual for the current timestep
		noise_residual = self.unet(latent_model_input, t, encoder_hidden_states=prompt_embeds)
		uncond_residual, text_cond_residual = noise_residual.chunk(2)
		guided_noise_residual = uncond_residual + guidance_scale * (text_cond_residual - uncond_residual)

		# Update latents by reversing the diffusion process for the current timestep
		latents = self.scheduler.step(guided_noise_residual, t, latents)[0]

	return latents

level: 2 layout: custom-two-cols leftPercent: 0.5

Step 3: denoise

Understand the whole flow

  • L80: Acquisition of timesteps using Scheduler
    (Scheduler will be described later)

  • L82: timesteps length loop
    (timesteps length = num_inference_steps)

  • L86: Denose by UNet
    (UNet will be described later)

  • L88: Calculate how much considering the prompt
    (Reference: Jonathan Ho, Tim Salimans: “Classifier-Free Diffusion Guidance”, 2022; arXiv:2207.12598.)

  • L91: The strength of the deny is determined by Scheduler.

::right::

pipeline.py#L82-L93

@torch.no_grad()
def denoise(self, latents, prompt_embeds, num_inference_steps=50, guidance_scale=7.5):
	"""
	Iteratively denoise the latent space using the diffusion model to produce an image.
	"""
	timesteps, num_inference_steps = self.retrieve_timesteps(num_inference_steps)

	for t in timesteps:
		latent_model_input = torch.cat([latents] * 2)
		
		# Predict the noise residual for the current timestep
		noise_residual = self.unet(latent_model_input, t, encoder_hidden_states=prompt_embeds)
		uncond_residual, text_cond_residual = noise_residual.chunk(2)
		guided_noise_residual = uncond_residual + guidance_scale * (text_cond_residual - uncond_residual)

		# Update latents by reversing the diffusion process for the current timestep
		latents = self.scheduler.step(guided_noise_residual, t, latents)[0]

	return latents

level: 2

Determine the strength
of denoising

<iframe frameborder="0" scrolling="yes" class="overflow-scroll iframe-full-code" allow="clipboard-write" src="https://emgithub.com/iframe.html?target=https%3A%2F%2Fgithub.com%2Fmasaishi%2Fparediffusers%2Fblob%2Fmain%2Fsrc%2Fparediffusers%2Fscheduler.py&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></iframe>

level: 2 layout: custom-two-cols leftPercent: 0.5

Scheduler

  • L49: Get alpha_prod_t(0~1.0)
    (Indicates how much the original data is retained.)

  • L50: Get alpha_prod_t_prev(0~1.0)

  • L52: alpha_prod_t + beta_prod_t = 1

  • L53: Estimate the original sample from the current sample and model output.

  • L54: Calculate the estimated value of the added noise.

  • L56: Calculate the direction to restore it to the original image.

  • L57: Calculate the sample that goes one step further in the denoising by using 3 values.

::right::

scheduler.py#L40-L59

def step(
	self,
	model_output: torch.FloatTensor,
	timestep: int,
	sample: torch.FloatTensor,
) -> list:
	"""Perform a single step of denoising in the diffusion process."""
	prev_timestep = timestep - self.config.num_train_timesteps // self.num_inference_steps

	alpha_prod_t = self.alphas_cumprod[timestep]
	alpha_prod_t_prev = self.alphas_cumprod[prev_timestep] if prev_timestep >= 0 else self.final_alpha_cumprod

	beta_prod_t = 1 - alpha_prod_t
	pred_original_sample = (alpha_prod_t**0.5) * sample - (beta_prod_t**0.5) * model_output
	pred_epsilon = (alpha_prod_t**0.5) * model_output + (beta_prod_t**0.5) * sample

	pred_sample_direction = (1 - alpha_prod_t_prev) ** (0.5) * pred_epsilon
	prev_sample = alpha_prod_t_prev ** (0.5) * pred_original_sample + pred_sample_direction

	return prev_sample, pred_original_sample

level: 2

scheduler.py#L40-L59

def step(
	self,
	model_output: torch.FloatTensor,
	timestep: int,
	sample: torch.FloatTensor,
) -> list:
	"""Perform a single step of denoising in the diffusion process."""
	prev_timestep = timestep - self.config.num_train_timesteps // self.num_inference_steps

	alpha_prod_t = self.alphas_cumprod[timestep]
	alpha_prod_t_prev = self.alphas_cumprod[prev_timestep] if prev_timestep >= 0 else self.final_alpha_cumprod

	beta_prod_t = 1 - alpha_prod_t
	pred_original_sample = (alpha_prod_t**0.5) * sample - (beta_prod_t**0.5) * model_output
	pred_epsilon = (alpha_prod_t**0.5) * model_output + (beta_prod_t**0.5) * sample

	pred_sample_direction = (1 - alpha_prod_t_prev) ** (0.5) * pred_epsilon
	prev_sample = alpha_prod_t_prev ** (0.5) * pred_original_sample + pred_sample_direction

	return prev_sample, pred_original_sample

level: 2 layout: center transition: fade

understand-stable-diffusion-slidev-notebooks/scheduler.ipynb


level: 2 layout: center transition: fade

+

understand-stable-diffusion-slidev-notebooks/scheduler.ipynb


level: 2 layout: center

understand-stable-diffusion-slidev-notebooks/scheduler.ipynb


level: 2

<iframe frameborder="0" scrolling="no" class="scale-40 -translate-y-1/2 absolute top-50% right-25% w-full h-240%" allow="clipboard-write" src="https://emgithub.com/iframe.html?target=https%3A%2F%2Fgithub.com%2Fmasaishi%2Funderstand-stable-diffusion-slidev-notebooks%2Fblob%2Fmain%2Fembed%2Fwithout_scheduler.ipynb&style=github&type=ipynb&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></iframe> <iframe frameborder="0" scrolling="no" class="scale-40 -translate-y-1/2 absolute top-54% left-25% w-full h-240%" allow="clipboard-write" src="https://emgithub.com/iframe.html?target=https%3A%2F%2Fgithub.com%2Fmasaishi%2Funderstand-stable-diffusion-slidev-notebooks%2Fblob%2F606a033780f0c9aa0681fd1468f91f3961a73a3f%2Fembed%2Fwith_scheduler.ipynb&style=github&type=ipynb&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></iframe>

level: 2 layout: center

I don't have any idea why ratio = 1.5 looks good.

understand-stable-diffusion-slidev-notebooks/scheduler_necessity.ipynb


level: 2

Using for Denoising

<iframe frameborder="0" scrolling="yes" class="overflow-scroll iframe-full-code" allow="clipboard-write" src="https://emgithub.com/iframe.html?target=https%3A%2F%2Fgithub.com%2Fmasaishi%2Fparediffusers%2Fblob%2Fmain%2Fsrc%2Fparediffusers%2Funet.py&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></iframe>

level: 2 layout: image image: /images/unet-figure.webp backgroundSize: 70% class: 'text-black'

Olaf Ronneberger, Philipp Fischer, Thomas Brox: “U-Net: Convolutional Networks for Biomedical Image Segmentation”, 2015; arXiv:1505.04597.


level: 2

Create UNet using Resnet and Transformer

<iframe frameborder="0" scrolling="yes" class="emg-res-transformer" allow="clipboard-write" src="https://emgithub.com/iframe.html?target=https%3A%2F%2Fgithub.com%2Fmasaishi%2Fparediffusers%2Fblob%2F675b3fdaf4435e9982f92ff933f78db64f16a980%2Fsrc%2Fparediffusers%2Fmodels%2Funet_2d_blocks.py%23L114-L141&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></iframe> <style> .emg-res-transformer { transform: scale(0.68) translate(-50%, -50%); transform-origin: top left; position: absolute; top: 63%; left: 50%; width: 100%; height: 130%; } </style>

layout: cover title: "Step 4: vae_decode" background: /backgrounds/vae.webp

Step 4: vae_decode

Prompt: VAE, abstract style, highly detailed, colors and shapes


level: 2 layout: center

Step 4: vae_decode

Decode into the image with VAE


level: 2 layout: custom-two-cols leftPercent: 0.4

Step 4: vae_decode

Understand the whole flow

  • L112: Decode into the image space.

  • L113: Normalization is performed during training, so denormalize it.

  • L114: Convert from tensor to Pil.image

::right::

pipeline.py#L107-L105

@torch.no_grad()
def vae_decode(self, latents):
	"""
	Decode the latent tensors using the VAE to produce an image.
	"""
	image = self.vae.decode(latents / self.vae.config.scaling_factor)[0]
	image = self.denormalize(image)
	image = self.tensor_to_image(image)
	return image

level: 2 layout: center

Variational Autoencoder (VAE)


level: 2

<iframe frameborder="0" scrolling="yes" class="overflow-scroll iframe-full-code" allow="clipboard-write" src="https://emgithub.com/iframe.html?target=https%3A%2F%2Fgithub.com%2Fmasaishi%2Fparediffusers%2Fblob%2Fmain%2Fsrc%2Fparediffusers%2Fvae.py&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></iframe>

level: 2

[1, 512, 64, 64]

[1, 512, 64, 64]

[1, 512, 128, 128]

[1, 512, 256, 256]

[1, 256, 512, 512]

[1, 128, 512, 512]

understand-stable-diffusion-slidev-notebooks/vae.ipynb

layout: cover title: Conclusion background: /backgrounds/summary.webp

9. Conclusion

Prompt: Summary, long-exposure photography, masterpieces


level: 2 layout: center

It's fun to read the library code!


level: 2 layout: center

A dissertation quoted everywhere in the library

diffusers/.../pipeline.py


level: 2 layout: center

Conclusion


Step 1: Convert prompt to embedding.
Step 2: Create random Latent.
Step 3: Denosing by using Scheduler and UNet.
Step 4: Decode to pixel space by VAE.


level: 2 layout: center

Conclusion

pipeline.py#L117-L135

def __call__(self, prompt: str, height: int = 512, width: int = 512, ...):
	prompt_embeds = self.encode_prompt(prompt)
	latents = self.get_latent(width, height).unsqueeze(dim=0)
	latents = self.denoise(latents, prompt_embeds, ...)
	image = self.vae_decode(latents)
	return image


level: 1 layout: center

Appendix


level: 2 layout: center

Other denoising samples

understand-stable-diffusion-slidev-notebooks/denoise.ipynb


level: 2 layout: center

Other decorded denoising samples

understand-stable-diffusion-slidev-notebooks/denoise.ipynb


level: 2 layout: center

UNet is really U form?

init             torch.Size([2, 4, 64, 64])
conv_in          torch.Size([2, 320, 64, 64])

down_blocks_0    torch.Size([2, 320, 32, 32])
down_blocks_1    torch.Size([2, 640, 16, 16])
down_blocks_2    torch.Size([2, 1280, 8, 8])
down_blocks_3    torch.Size([2, 1280, 8, 8])

mid_block        torch.Size([2, 1280, 8, 8])

up_blocks0       torch.Size([2, 1280, 16, 16])
up_blocks1       torch.Size([2, 1280, 32, 32])
up_blocks2       torch.Size([2, 640, 64, 64])
up_blocks3       torch.Size([2, 320, 64, 64])

conv_out         torch.Size([2, 4, 64, 64])

understand-stable-diffusion-slidev-notebooks/unet.ipynb