theme | colorSchema | background | title | info | author | keywords | favicon | export | lineNumbers | class | highlighter | drawings | transition | mdc | fonts | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
seriph |
light |
backgrounds/understand-sd.webp |
Understand Stable Diffusion from Code |
This slide explains image generation using Latent Diffusion Models through source code. This slide explains the mechanism of image generation through the code of a library called parediffusers, which simplifies diffusers. |
masaishi |
|
images/icon_tea_light.webp |
|
true |
text-center |
shiki |
|
slide-left |
true |
|
Prompt: Understand Stable Diffusion from code, cyberpunk theme, best quality, high resolution, concept art
I'm interested in AI/ML and GIS.
- Tea
- Tennis
Purpose
In writing this slide, I realized many things that I did not understand. I may have explained things incorrectly, so I would appreciate it if you could tell me if I am unclear or if I have made any mistakes by clicking on the link below.
Issues: Please let me know if you find any mistakes.
Discussions: Please let me know if you have questions.
Pull Requests: Please let me know if you have any improvements.
The concept of this slide is to introduce the flow of image generation with code, so basically all the code in this slide can be actually run.
understand-stable-diffusion-slidev: Repository of this slide.
understand-stable-diffusion-slidev-notebooks: Notebooks for generating sample images and gifs.
parediffusers: Simple library for generating images without using huggingface/diffusers.
Prompt: Stable Diffusion, watercolor painting, best quality, high resolution
- An image generation model based on the Latent Diffusion Model (LDM) developed by Stability AI.
- It can be used for Text-to-Image, Image-to-Image.
- It can be easily moved by using Diffusers.
- https://arxiv.org/abs/2112.10752
- Library for Diffusion Models developed by Hugging Face🤗.
- Easy to run many image generation models.
- https://github.com/huggingface/diffusers
Install the Diffusers library:
!pip install transformers diffusers accelerate -U
Generate an image from text:
import torch
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-2",
dtype=torch.float16,
).to(device=torch.device("cuda"))
prompt = "painting depicting the sea, sunrise, ship, artstation, 4k, concept art"
image = pipe(prompt, width=512, height=512).images[0]
display(image)
<iframe frameborder="0" scrolling="yes" class="overflow-scroll mt-10" style="width:100%; height:85%;" allow="clipboard-write" src="https://emgithub.com/iframe.html?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fdiffusers%2Fblob%2Fmain%2Fsrc%2Fdiffusers%2Fpipelines%2Fstable_diffusion%2Fpipeline_stable_diffusion.py&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></iframe>
<iframe frameborder="0" scrolling="yes" class="overflow-scroll mt-10" style="width:100%; height:85%;" allow="clipboard-write" src="https://emgithub.com/iframe.html?target=https%3A%2F%2Fgithub.com%2Fmasaishi%2Fparediffusers%2Fblob%2Fmain%2Fsrc%2Fparediffusers%2Fpipeline.py&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></iframe>
Install the PareDiffusers library:
!pip install parediffusers
Generate an image from text:
import torch
from parediffusers import PareDiffusionPipeline
pipe = PareDiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-2",
device=torch.device("cuda"),
dtype=torch.float16,
)
prompt = "painting depicting the sea, sunrise, ship, artstation, 4k, concept art"
image = pipe(prompt, width=512, height=512)
display(image)
```python {all}
image = pipe(prompt, width=512, height=512)
```
```python {all}
def __call__(self, prompt: str, height: int = 512, width: int = 512, ...):
prompt_embeds = self.encode_prompt(prompt)
latents = self.get_latent(width, height).unsqueeze(dim=0)
latents = self.denoise(latents, prompt_embeds, ...)
image = self.vae_decode(latents)
return image
```
```md {all}
1. `encode_prompt` : Convert prompt to embedding.
2. `get_latent` : Create random Latent.
3. `denoise` : Denosing by using Scheduler and UNet.
4. `vae_decode` : Decode to pixel space by VAE.
```
1. `encode_prompt` : Convert prompt to embedding.
2. `get_latent` : Create random Latent.
3. `denoise` : Denosing by using Scheduler and UNet.
4. `vae_decode` : Decode to pixel space by VAE.
What is Latent Diffusion Model (LDM)?
What is Denoising Diffusion Probabilistic Model (DDPM)?
It is used for audio and other data in general, but this slide will discuss images.
- Diffusion process is used to preprocess training data. Stochastic process(Markov chain)
- Reverse process is used to recover the original data from the noisy data.
Jonathan Ho, Ajay Jain, Pieter Abbeel: “Denoising Diffusion Probabilistic Models”, 2020; arXiv:2006.11239.
What is Latent Diffusion Model (LDM)?
Difference of Loss Functions
$$ L_{DM} := \mathbb{E}_{x, \epsilon \sim \mathcal{N}(0, 1), t}\Big[ \Vert \epsilon - \epsilon_\theta(x_{t},t) \Vert_{2}^{2}\Big] \, . $$
$$ L_{LDM} := \mathbb{E}_{\mathcal{E}(x), \epsilon \sim \mathcal{N}(0, 1), t}\Big[ \Vert \epsilon - \epsilon_\theta(z_{t},t) \Vert_{2}^{2}\Big] \, . $$
Latent Diffusion Model (LDM)
$$ L_{LDM} := \mathbb{E}_{\mathcal{E}(x), \epsilon \sim \mathcal{N}(0, 1), t}\Big[ \Vert \epsilon - \epsilon_\theta(z_{t},t) \Vert_{2}^{2}\Big] \, . $$
Latent Diffusion Model (LDM) with Conditioning
$$ L_{LDM} := \mathbb{E}_{\mathcal{E}(x), y, \epsilon \sim \mathcal{N}(0, 1), t }\Big[ \Vert \epsilon - \epsilon_\theta(z_{t},t, \tau_\theta(y)) \Vert_{2}^{2}\Big] \, , $$
$$ \begin{equation*} Q = W^{(i)}_Q \cdot \varphi_i(z_t), ; K = W^{(i)}K \cdot \tau\theta(y), ; V = W^{(i)}V \cdot \tau\theta(y) . \nonumber % \end{equation*} $$
Jonathan Ho, Ajay Jain, Pieter Abbeel: “Denoising Diffusion Probabilistic Models”, 2020; arXiv:2006.11239.
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer: “High-Resolution Image Synthesis with Latent Diffusion Models”, 2021; arXiv:2112.10752.
<iframe frameborder="0" scrolling="no" style="width:100%; height:163px;" allow="clipboard-write" src="https://emgithub.com/iframe.html?target=https%3A%2F%2Fgithub.com%2Fmasaishi%2Fparediffusers%2Fblob%2F035772c684ae8d16c7c908f185f6413b72658126%2Fsrc%2Fparediffusers%2Fpipeline.py%23L131-L134&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></iframe>
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer: “High-Resolution Image Synthesis with Latent Diffusion Models”, 2021; arXiv:2112.10752.
What is Latent Space?
The flow of image generation in 4 steps
Step 1: Convert prompt to embedding.
Step 2: Create random Latent.
Step 3: Denosing by using Scheduler and UNet.
Step 4: Decode to pixel space by VAE.
The flow of image generation in 4 steps
Prompt: Pipeline, cyberpunk theme, best quality, high resolution, concept art
Step 1: encode_prompt
Step 1: encode_prompt
Necessities
Calling another function twice within encode_prompt
::right::
def encode_prompt(self, prompt: str):
"""
Encode the text prompt into embeddings using the text encoder.
"""
prompt_embeds = self.get_embes(prompt, self.tokenizer.model_max_length)
negative_prompt_embeds = self.get_embes([''], prompt_embeds.shape[1])
prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])
return prompt_embeds
Where are Necessities used?
::right::
def encode_prompt(self, prompt: str):
"""
Encode the text prompt into embeddings using the text encoder.
"""
prompt_embeds = self.get_embes(prompt, self.tokenizer.model_max_length)
negative_prompt_embeds = self.get_embes([''], prompt_embeds.shape[1])
prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])
return prompt_embeds
def get_embes(self, prompt, max_length):
"""
Encode the text prompt into embeddings using the text encoder.
"""
text_inputs = self.tokenizer(prompt, padding="max_length", max_length=max_length, truncation=True, return_tensors="pt")
text_input_ids = text_inputs.input_ids.to(self.device)
prompt_embeds = self.text_encoder(text_input_ids)[0].to(dtype=self.dtype, device=self.device)
return prompt_embeds
Where are Necessities used?
-
L54:
CLIPTokenizer
: Token into text (Prompt). By making it a vector, it makes it easier to handle AI. -
L56:
CLIPTextModel
: Multi -modal model of language and image. In the image generation, the expression (embedding) of the image you want to make at the prompt is extracted.
::right::
@classmethod
def from_pretrained(cls, model_name, device=torch.device("cuda"), dtype=torch.float16):
# Ommit comments
tokenizer = CLIPTokenizer.from_pretrained(model_name, subfolder="tokenizer")
text_encoder = CLIPTextModel.from_pretrained(model_name, subfolder="text_encoder")
scheduler = PareDDIMScheduler.from_config(model_name, subfolder="scheduler")
unet = PareUNet2DConditionModel.from_pretrained(model_name, subfolder="unet")
vae = PareAutoencoderKL.from_pretrained(model_name, subfolder="vae")
return cls(tokenizer, text_encoder, scheduler, unet, vae, device, dtype)
def get_embes(self, prompt, max_length):
"""
Encode the text prompt into embeddings using the text encoder.
"""
text_inputs = self.tokenizer(prompt, padding="max_length", max_length=max_length, truncation=True, return_tensors="pt")
text_input_ids = text_inputs.input_ids.to(self.device)
prompt_embeds = self.text_encoder(text_input_ids)[0].to(dtype=self.dtype, device=self.device)
return prompt_embeds
Understand the whole flow
-
L54:
CLIPTokenizer
: Token into text (Prompt). By making it a vector, it makes it easier to handle AI. -
L56:
CLIPTextModel
: Multi -modal model of language and image. In the image generation, the expression (embedding) of the image you want to make at the prompt is extracted.
- L46: Negative_prompt is an empty character string to make it simple.
::right::
tokenizer = CLIPTokenizer.from_pretrained(model_name, subfolder="tokenizer")
text_encoder = CLIPTextModel.from_pretrained(model_name, subfolder="text_encoder")
def encode_prompt(self, prompt: str):
"""
Encode the text prompt into embeddings using the text encoder.
"""
prompt_embeds = self.get_embes(prompt, self.tokenizer.model_max_length)
negative_prompt_embeds = self.get_embes([''], prompt_embeds.shape[1])
prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])
return prompt_embeds
def get_embes(self, prompt, max_length):
"""
Encode the text prompt into embeddings using the text encoder.
"""
text_inputs = self.tokenizer(prompt, padding="max_length", max_length=max_length, truncation=True, return_tensors="pt")
text_input_ids = text_inputs.input_ids.to(self.device)
prompt_embeds = self.text_encoder(text_input_ids)[0].to(dtype=self.dtype, device=self.device)
return prompt_embeds
<iframe frameborder="0" scrolling="no" class="emg-iframe-text-inputs" allow="clipboard-write" src="https://emgithub.com/iframe.html?target=https%3A%2F%2Fgithub.com%2Fmasaishi%2Funderstand-stable-diffusion-slidev-notebooks%2Fblob%2Fmain%2Fembed%2Fch5-text_inputs.ipynb&style=github&type=ipynb&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></iframe> <style> .emg-iframe-text-inputs { transform: scale(0.9) translate(-50%, -50%); /* Apply both transformations */ transform-origin: top left; position: absolute; top: 50%; left: 50%; width: 100%; height: 100%; } </style>
<iframe frameborder="0" scrolling="no" class="emg-iframe-prompt-embeds" allow="clipboard-write" src="https://emgithub.com/iframe.html?target=https%3A%2F%2Fgithub.com%2Fmasaishi%2Funderstand-stable-diffusion-slidev-notebooks%2Fblob%2Fmain%2Fembed%2Fch5-prompt_embeds.ipynb&style=github&type=ipynb&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></iframe> <style> .emg-iframe-prompt-embeds { transform: scale(0.8) translate(-50%, -50%); transform-origin: top left; position: absolute; top: 57%; left: 50%; width: 100%; height: 130%; } </style>
<iframe frameborder="0" scrolling="yes" class="overflow-scroll emg-iframe-play-prompt-embeds" allow="clipboard-write" src="https://emgithub.com/iframe.html?target=https%3A%2F%2Fgithub.com%2Fmasaishi%2Fparediffusers%2Fblob%2Fmain%2Fnotebooks%2Fch0.0.2_Play_prompt_embeds.ipynb&style=github&type=ipynb&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></iframe> <style> .emg-iframe-play-prompt-embeds { transform: scale(0.5) translate(-50%, -50%); /* Apply both transformations */ transform-origin: top left; position: absolute; top: 50%; left: 50%; width: 100%; height: 160%; } </style>
Prompt: Scheduler, flat vector illustration, best quality, high resolution
Step 2: get_latent
Necessities
Understand the whole flow
- L63: Generate random tensor of 1/8 size
::right::
def get_latent(self, width: int, height: int):
"""
Generate a random initial latent tensor to start the diffusion process.
"""
return torch.randn((4, width // 8, height // 8)).to(
device=self.device, dtype=self.dtype
)
Prompt: UNet, watercolor painting, detailed, brush strokes, best quality, high resolution
Step 3: denoise
understand-stable-diffusion-slidev-notebooks/denoise.ipynb
understand-stable-diffusion-slidev-notebooks/denoise.ipynb
Necessities
Step 3: denoise
Where are Necessities used?
-
L86: UNet
-
L91: Scheduler
::right::
@torch.no_grad()
def denoise(self, latents, prompt_embeds, num_inference_steps=50, guidance_scale=7.5):
"""
Iteratively denoise the latent space using the diffusion model to produce an image.
"""
timesteps, num_inference_steps = self.retrieve_timesteps(num_inference_steps)
for t in timesteps:
latent_model_input = torch.cat([latents] * 2)
# Predict the noise residual for the current timestep
noise_residual = self.unet(latent_model_input, t, encoder_hidden_states=prompt_embeds)
uncond_residual, text_cond_residual = noise_residual.chunk(2)
guided_noise_residual = uncond_residual + guidance_scale * (text_cond_residual - uncond_residual)
# Update latents by reversing the diffusion process for the current timestep
latents = self.scheduler.step(guided_noise_residual, t, latents)[0]
return latents
Where are Necessities used?
-
L86: UNet2DConditionModel
-
L91: DDIMScheduler
::right::
@classmethod
def from_pretrained(cls, model_name, device=torch.device("cuda"), dtype=torch.float16):
# Ommit comments
tokenizer = CLIPTokenizer.from_pretrained(model_name, subfolder="tokenizer")
text_encoder = CLIPTextModel.from_pretrained(model_name, subfolder="text_encoder")
scheduler = PareDDIMScheduler.from_config(model_name, subfolder="scheduler")
unet = PareUNet2DConditionModel.from_pretrained(model_name, subfolder="unet")
vae = PareAutoencoderKL.from_pretrained(model_name, subfolder="vae")
return cls(tokenizer, text_encoder, scheduler, unet, vae, device, dtype)
for t in timesteps:
latent_model_input = torch.cat([latents] * 2)
# Predict the noise residual for the current timestep
noise_residual = self.unet(latent_model_input, t, encoder_hidden_states=prompt_embeds)
uncond_residual, text_cond_residual = noise_residual.chunk(2)
guided_noise_residual = uncond_residual + guidance_scale * (text_cond_residual - uncond_residual)
# Update latents by reversing the diffusion process for the current timestep
latents = self.scheduler.step(guided_noise_residual, t, latents)[0]
return latents
Understand the whole flow
-
L80: Acquisition of timesteps using Scheduler
(Scheduler will be described later) -
L82: timesteps length loop
(timesteps length = num_inference_steps) -
L86: Denose by UNet
(UNet will be described later) -
L88: Calculate how much considering the prompt
(Reference: Jonathan Ho, Tim Salimans: “Classifier-Free Diffusion Guidance”, 2022; arXiv:2207.12598.) -
L91: The strength of the deny is determined by Scheduler.
::right::
@torch.no_grad()
def denoise(self, latents, prompt_embeds, num_inference_steps=50, guidance_scale=7.5):
"""
Iteratively denoise the latent space using the diffusion model to produce an image.
"""
timesteps, num_inference_steps = self.retrieve_timesteps(num_inference_steps)
for t in timesteps:
latent_model_input = torch.cat([latents] * 2)
# Predict the noise residual for the current timestep
noise_residual = self.unet(latent_model_input, t, encoder_hidden_states=prompt_embeds)
uncond_residual, text_cond_residual = noise_residual.chunk(2)
guided_noise_residual = uncond_residual + guidance_scale * (text_cond_residual - uncond_residual)
# Update latents by reversing the diffusion process for the current timestep
latents = self.scheduler.step(guided_noise_residual, t, latents)[0]
return latents
Determine the strength
of denoising
-
L49: Get alpha_prod_t(0~1.0)
(Indicates how much the original data is retained.) -
L50: Get alpha_prod_t_prev(0~1.0)
-
L52: alpha_prod_t + beta_prod_t = 1
-
L53: Estimate the original sample from the current sample and model output.
-
L54: Calculate the estimated value of the added noise.
-
L56: Calculate the direction to restore it to the original image.
-
L57: Calculate the sample that goes one step further in the denoising by using 3 values.
::right::
def step(
self,
model_output: torch.FloatTensor,
timestep: int,
sample: torch.FloatTensor,
) -> list:
"""Perform a single step of denoising in the diffusion process."""
prev_timestep = timestep - self.config.num_train_timesteps // self.num_inference_steps
alpha_prod_t = self.alphas_cumprod[timestep]
alpha_prod_t_prev = self.alphas_cumprod[prev_timestep] if prev_timestep >= 0 else self.final_alpha_cumprod
beta_prod_t = 1 - alpha_prod_t
pred_original_sample = (alpha_prod_t**0.5) * sample - (beta_prod_t**0.5) * model_output
pred_epsilon = (alpha_prod_t**0.5) * model_output + (beta_prod_t**0.5) * sample
pred_sample_direction = (1 - alpha_prod_t_prev) ** (0.5) * pred_epsilon
prev_sample = alpha_prod_t_prev ** (0.5) * pred_original_sample + pred_sample_direction
return prev_sample, pred_original_sample
def step(
self,
model_output: torch.FloatTensor,
timestep: int,
sample: torch.FloatTensor,
) -> list:
"""Perform a single step of denoising in the diffusion process."""
prev_timestep = timestep - self.config.num_train_timesteps // self.num_inference_steps
alpha_prod_t = self.alphas_cumprod[timestep]
alpha_prod_t_prev = self.alphas_cumprod[prev_timestep] if prev_timestep >= 0 else self.final_alpha_cumprod
beta_prod_t = 1 - alpha_prod_t
pred_original_sample = (alpha_prod_t**0.5) * sample - (beta_prod_t**0.5) * model_output
pred_epsilon = (alpha_prod_t**0.5) * model_output + (beta_prod_t**0.5) * sample
pred_sample_direction = (1 - alpha_prod_t_prev) ** (0.5) * pred_epsilon
prev_sample = alpha_prod_t_prev ** (0.5) * pred_original_sample + pred_sample_direction
return prev_sample, pred_original_sample
understand-stable-diffusion-slidev-notebooks/scheduler.ipynb
understand-stable-diffusion-slidev-notebooks/scheduler.ipynb
understand-stable-diffusion-slidev-notebooks/scheduler.ipynb
<iframe frameborder="0" scrolling="no" class="scale-40 -translate-y-1/2 absolute top-50% right-25% w-full h-240%" allow="clipboard-write" src="https://emgithub.com/iframe.html?target=https%3A%2F%2Fgithub.com%2Fmasaishi%2Funderstand-stable-diffusion-slidev-notebooks%2Fblob%2Fmain%2Fembed%2Fwithout_scheduler.ipynb&style=github&type=ipynb&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></iframe> <iframe frameborder="0" scrolling="no" class="scale-40 -translate-y-1/2 absolute top-54% left-25% w-full h-240%" allow="clipboard-write" src="https://emgithub.com/iframe.html?target=https%3A%2F%2Fgithub.com%2Fmasaishi%2Funderstand-stable-diffusion-slidev-notebooks%2Fblob%2F606a033780f0c9aa0681fd1468f91f3961a73a3f%2Fembed%2Fwith_scheduler.ipynb&style=github&type=ipynb&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></iframe>
understand-stable-diffusion-slidev-notebooks/scheduler_necessity.ipynb
Using for Denoising
<iframe frameborder="0" scrolling="yes" class="overflow-scroll iframe-full-code" allow="clipboard-write" src="https://emgithub.com/iframe.html?target=https%3A%2F%2Fgithub.com%2Fmasaishi%2Fparediffusers%2Fblob%2Fmain%2Fsrc%2Fparediffusers%2Funet.py&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></iframe>Olaf Ronneberger, Philipp Fischer, Thomas Brox: “U-Net: Convolutional Networks for Biomedical Image Segmentation”, 2015; arXiv:1505.04597.
<iframe frameborder="0" scrolling="yes" class="emg-res-transformer" allow="clipboard-write" src="https://emgithub.com/iframe.html?target=https%3A%2F%2Fgithub.com%2Fmasaishi%2Fparediffusers%2Fblob%2F675b3fdaf4435e9982f92ff933f78db64f16a980%2Fsrc%2Fparediffusers%2Fmodels%2Funet_2d_blocks.py%23L114-L141&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></iframe> <style> .emg-res-transformer { transform: scale(0.68) translate(-50%, -50%); transform-origin: top left; position: absolute; top: 63%; left: 50%; width: 100%; height: 130%; } </style>
Prompt: VAE, abstract style, highly detailed, colors and shapes
Step 4: vae_decode
Understand the whole flow
- L112: Decode into the image space.
- L113: Normalization is performed during training, so denormalize it.
- L114: Convert from tensor to Pil.image
::right::
@torch.no_grad()
def vae_decode(self, latents):
"""
Decode the latent tensors using the VAE to produce an image.
"""
image = self.vae.decode(latents / self.vae.config.scaling_factor)[0]
image = self.denormalize(image)
image = self.tensor_to_image(image)
return image
<iframe frameborder="0" scrolling="yes" class="overflow-scroll iframe-full-code" allow="clipboard-write" src="https://emgithub.com/iframe.html?target=https%3A%2F%2Fgithub.com%2Fmasaishi%2Fparediffusers%2Fblob%2Fmain%2Fsrc%2Fparediffusers%2Fvae.py&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></iframe>
Prompt: Summary, long-exposure photography, masterpieces
Step 1: Convert prompt to embedding.
Step 2: Create random Latent.
Step 3: Denosing by using Scheduler and UNet.
Step 4: Decode to pixel space by VAE.
def __call__(self, prompt: str, height: int = 512, width: int = 512, ...):
prompt_embeds = self.encode_prompt(prompt)
latents = self.get_latent(width, height).unsqueeze(dim=0)
latents = self.denoise(latents, prompt_embeds, ...)
image = self.vae_decode(latents)
return image
understand-stable-diffusion-slidev-notebooks/denoise.ipynb
understand-stable-diffusion-slidev-notebooks/denoise.ipynb
init torch.Size([2, 4, 64, 64])
conv_in torch.Size([2, 320, 64, 64])
down_blocks_0 torch.Size([2, 320, 32, 32])
down_blocks_1 torch.Size([2, 640, 16, 16])
down_blocks_2 torch.Size([2, 1280, 8, 8])
down_blocks_3 torch.Size([2, 1280, 8, 8])
mid_block torch.Size([2, 1280, 8, 8])
up_blocks0 torch.Size([2, 1280, 16, 16])
up_blocks1 torch.Size([2, 1280, 32, 32])
up_blocks2 torch.Size([2, 640, 64, 64])
up_blocks3 torch.Size([2, 320, 64, 64])
conv_out torch.Size([2, 4, 64, 64])