Stable Diffusion v2 uses OpenClip for text embedding. Stable Diffusion v1 uses Open AI's CLIP Vit-L/14 for text embedding. The reasons for this change are:
A larger text encoder model improves image quality
Open AI's CLIP models are opensource, but the models were trained with proprietary data.
Stable Diffusion v1.4 is trained with
- 237k steps at resolution 256×256 on laion2B-en dataset.
- 194k steps at resolution 512×512 on laion-high-resolution.
- 225k steps at 512×512 on “laion-aesthetics v2 5+“
with 10% dropping of text conditioning.
Stable Diffusion v2 is trained with
- 550k steps at the resolution
256x256
on a subset of LAION-5B filtered for explicit pornographic material, using the LAION-NSFW classifier withpunsafe=0.1
and an aesthetic score >=4.5
. - 850k steps at the resolution
512x512
on the same dataset on images with resolution>= 512x512
. - 150k steps using a v-objective on the same dataset.
- Resumed for another 140k steps on
768x768
images.
Stable Diffusion v2.1 is fine-tuned on v2.0
- additional 55k steps on the same dataset (with
punsafe=0.1
) - another 155k extra steps with
punsafe=0.98
So basically, they turned off the NSFW filter in the last training steps.
Users generally find it harder to use Stable Diffusion v2 to control styles and generate celebrities.** Although Stability AI did not explicitly filter out artist and celebrity names, their effects are much weaker in v2**. This is likely due to the difference in training data. Open AI’s proprietary data may have more artwork and celebrity photos. Their data is probably highly filtered so that everything and everyone looks fine and pretty.