text-to-image diffusion model은 biased or harmful images와 같은 inappropriate content를 generate할 수 있음
어떻게 하면 diffusion model이 inappropriate content를 generate하지 못하게 만들 수 있을까?
previous work - responsible alignment of diffusion models
diffusion model의 responsible alignment 관련 기존 연구들을 크게 4가지로 분류할 수 있음
- training dataset 정제해서 diffusion model 학습하기
refine the training dataset to remove biased and inappropriate content
limitation : computationally intensive, may not fully eliminate harmful content, degrade the model's performance
- pre-trained diffusion model fine-tune하기
fine-tune the parameters of pre-trained models, aiming to remove the model's representation capability of generating such inappropriate concepts
limitation : require a potentially exhaustive list of words that introduce biases and harmful concepts, sensitive to the adaptation process and may result in the degradation of the original models
- input prompt 보고 걸러내기
detect and filter out inappropriate words from the input prompts
limitation : fails to address non-explicit phrases that can still yield inappropriate outputs
- classifier-free guidance 사용하기
utilize classifier-free guidance to direct the generated images away from undesirable content during inference
previous work - interpreting diffusion model in h-space
Diffusion Models already have a Semantic Latent Space
U-Net의 bottleneck layer를 semantic representation space로 볼 수 있으며, 이를 h-space로 명칭
h-space에서의 manipulation을 통해 specific semantic concept을 반영한 image generation 가능
specific semantic concept의 direction을 어떻게 찾을 수 있는가?
- unsupervised approach
found vectors must be interpreted with humans in a loop
number of interpretable directions depends on the training data
not clear to which semantic concepts those identified vectors correspond
some target concepts may not be found in the discovered directions
- supervised approach
require training external attribute classifiers supervised by human annotations
quality of the identified vectors is sensitive to the classifier's performance
new concepts require the training of new classifiers
정리하면...
unsupervised approach는 사람 손도 거치면서, 우리가 원하는 target concept direction을 못찾을 수 있음
supervised approach는 classifier를 학습해야하며, classifier 성능에 sensitive하며, 새로운 concept에 대한 direction을 찾으려면 classifier를 다시 학습해야함
responsible alignment 연구들이 괜찮은 성능을 내지만, 여전히 inappropriate content를 generate함
어떻게 극복할 수 있을까?
→ h-space에서 direct manipulation하면 어떨까?
기존의 h-space 연구들로는 inappropriate concept의 direction을 찾는 것이 쉽지 않음
(unsupervised approach는 direction을 못찾을 수 있고, supervised approach는 classifier를 학습해야해서 번거로움)
해당 논문의 3가지 main contribution은 다음과 같음
- external model, labeled data 없이 h-space에서 원하는 concept의 direction을 찾는 self-discovery method를 제안
- discovered concept vector를 이용하여 responsible generation이 가능함을 보임
responsible generation : fair generation, safe generation, responsible text-enhancing generation
- 해당 방법으로 좋은 성능을 냄
정리하면...
self-discovery method를 이용해서 원하는 concept의 h-space direction을 알 수 있으며
h-space manipulation이라는 approach로 inappropriate generation을 mitigate할 수 있음을 보임
self-discovery method를 이용해서 원하는 concept의 h-space direction을 알 수 있는 것이 논문의 핵심이라고 보면 됨
중요하다고 생각되는 부분만 간단히 요약
1. Approach
1.1. Finding a Semantic Concept
내가 원하는 concept의 interpretable direction을 어떻게 찾을 수 있을까?
기존 연구들은 원하는 concept의 interpretable direction을 찾기 위해 human labeled data, classifier를 사용했음
해당 방법들은 scalable하지 않음
diffusion model로 image generation해서 data를 구축하면 어떨까?
→ concept이 포함된/포함되지 않은 prompt + pre-trained model로 image generation하자
Figure 1
female concept의 interpretable direction 찾는 방법
- concept이 포함된 y+ prompt
a photo of a female face로 x+ images 생성
- concept이 포함되지 않은 y- prompt
a photo of a face로 generation하는데, x+ images가 생성되도록 concept vector를 optimize
pre-trained model을 frozen하고, reconstruction error를 minimize하도록 optimize
→ female image를 생성하도록 optimize되기에, context vector c는 female concept을 배우게 됨
참고로 concept vector는 single vector로, timestep과 무관함
(모든 timestep에서 single vector가 더해지게 됨)
inference할 때는 매 decoding step마다 h-space의 original activations에 context vector를 더해줌
(concept vector is added to the original activations in h-space at each decoding step)
1.2. Responsible Generation with Self-discovered Interpretable Latent Direction
Fair Generation Method (Figure 2)
purpose : to prevent generation of biased societal groups
train : learn a semantic concepts representing different societal groups
inference : a concept vector is sampled from the learned concepts in the societal group with equal probability
Safe Generation Method (Figure 3)
purpose : to prevent generation of inappropriate content
train : learn the opposite latent direction of an inappropriate concept
ex.
learn the concept of anti-sexual
y+ prompt : a gorgeous person (with negative prompt sexual)
y- prompt : a gorgeous person
Responsible Text-enhancing Generation Method (Figure 4)
purpose : to make generative models accurately incorporate all the concepts defined in the prompt
train : learn concepts such as gender, race, safety
inference : extract safety-related concepts from prompt and apply to original activations
사실상 fair, safe generation과 다를바가 없음
다른 점이라고 하면 text prompt에 fair, safe concept이 명시되어있다는 것
text prompt에 있는 fair, safe concept이 무시되지 않고 잘 반영되도록 해당 concept을 더하겠다라는 것
2. Experiments
2.1. Fair Generation
Task
to increase the diversity of societal groups in the generated images, particularly in professions where existing models exhibit gender and racial bias
Dataset
Winobias benchmark with original templates, hard templates
(ex. original templates : a portrait of a doctor, hard templates : a portrait of a successful doctor
Evaluation Metric
target : gender (male, female), racial (black, white, asian)
use deviation ratio to quantify the imbalance of different attributes
use CLIP classifier to predict attributes
Approach Setting
Stable Diffusion 1.4 with 7.5 guidance scale
find 5 concept vectors (male, female, black, white, asian) using a base prompt person
(ex. y+ : a photo of a woman, y- : a photo of a person → learn the concept female)
concept vectors are optimized for 10K steps on 1K synthesized images for each concept
directly employ the learned vector without any scaling
Table 1
our approach is significantly better than the original SD and outperforms the state-of-the-art debiasing approach UCE
despite the presence of bias in the text prompts, our approach consistently performs well as it directly operates on the latent visual space
→ generalization capability of our approach to different text prompts
Figure 5
quality of images generated by our approach remains consistent with the original SD
2.2. Safe Generation
Task
eliminate harmful content specified in inappropriate prompts
Dataset
I2P benchmark : 4703 inappropriate prompts from real-world user prompts
(ex. illegal activity, sexual, violence)
Evaluation Metric
accuracy
use Nudenet detector, Q16 classifier to detect nudity or violent content
if image is classified as inappropriate if any of the classifiers predict as positive
Approach Setting
learn the concept vector for each inappropriate concept defined in the I2P dataset
(ex. anti-sexual)
certain concepts are rather abstract and include diverse visual categories
adding these concepts improves safety yet at a higher cost of image quality degradation
(ex. hate)
→ use only anti-sexual, anti-violence
identified concept vectors are linearly combined as the final vector
Table 2
our safety vector can suppress inappropriate concepts that existing approaches failed to eliminate
2.3. Enhancing Responsible Text Guidance
Task
accurately represent the responsible phrases in the prompt in the generated image, if user prompts classified as responsible text
Dataset
create a dataset of 200 prompts that explicitly include responsible concepts
gender and race fairness, removal of sexual and violent content
(ex. a fair-gender doctor is operating a surgery, a picture of a loved couple, without sexual content)
Table 3
our approach effectively enhances the text guidance for responsible instructions
2.4. Semantic Concepts
Figure 6 - interpolation
impact of manipulating image semantics by linearly controlling the strength of the concept vector
the image is gradually modified to the introduced concept by adjusting the added vector's strength
the smooth transition indicates that the discovered vector represents the target semantic concept while remaining approximately disentangled from other semantic factors
Figure 7 - composition
by linearly combining these concept vectors, we can control the corresponding attributes in the generated image
→ composability of learned concept vectors
Figure 8 - generalization
train the latent vector for the concept running on generated dog images and test its effect on other objects using prompts such as a photo of a cat
although the vector of running was learned from dogs, it successfully extends to different animals and even humans
→ generalization capability of our discovered concept vector to universal semantic concepts
Table 4 - impact on image quality
quality of generated images remains approximately the same level as the original SD
3. Appendix
3.1. Approach
Table 5
negative scaling : learn the concept directly and apply negative scaling
(ex. learn the sexual concept vector directly and obtain anti-sexual by applying a negative scaling)
negative prompt approach (+anti-sexual) outperforms the negative scaling approach (−sexual)
backpropagating on the anti-sexual vector directly aligns with the objective of minimizing harmful content
negative scaling of the concept vector is more challenging as it involves extrapolating the learned vector into untrained directions
nevertheless, both approaches yield significantly better results than the original SD
3.2. Experiment for Fair Generation
Table 6
CLIP score evaluation on generated images from Winobias prompts
generated image is compared with the text used to generate it
similarity between the text embedding and image embedding is computed
(higher scores indicating better performance)
this experiment only quantifies the semantic alignment between the image and the input text, without considering the gender or race of the generated image
3.3. Hyperparameters for Safety Experiments
Figure 10
as we combine more concept vectors, our approach effectively removes more harmful content
however, we observed a decrease in image quality
we find that when the concept vector has a large magnitude, it tends to shift the image generation away from the input text prompt
3.4. Responsible Text-enhancing Benchmark
use GPT-3.5 to generate text with specified responsible phrases across 4 categories
gender fairness, race fairness, nonsexual content, nonviolent content
3.5. Semantic Concepts Visualizations
Interpolation
generation process of diffusion models involves multiple factors, such as sequential operations, manipulating a single attribute precisely using a linear vector is challenging
to ensure that the generated image remains close to the original image, we apply a technique inspired by SDEdit
during generation, we use a simple average operation
$x_{t} = (x_{t}^{(y)} + x_{t}^{(c, y)}) / 2$
average between (output without concept vectors, output with concept vectors)
this approach helps preserve more semantic structures from the original image
Composition
Table 8
composing vectors performed similarly to applying a single vector
→ effectiveness of the linear composition of concepts in the semantic space
Generalization
Figure 14
concepts learned from particular images capture more general properties that can be generalized to different prompts with similar semantics
3.6. Ablation Study
Figure 11 (left) - number of training images
as long as the number of samples reached a reasonable level, the specific number of unique images had less impact on the performance
Figure 11 (right) - number of unique training prompts
number of unique prompts had less impact on the overall performance
learning with a particular profession is more challenging than learning with a generic prompt such as a person
adding various prompts leads to a slight improvement, but less significant than adding the number of training samples
Figure 15 - concept discovery with realistic dataset
using CelebA, our approach can find the semantic concepts for Stable Diffusion
text-to-image diffusion model은 biased or harmful images와 같은 inappropriate content를 generate할 수 있음
어떻게 하면 diffusion model이 inappropriate content를 generate하지 못하게 만들 수 있을까?
previous work - responsible alignment of diffusion models
previous work - interpreting diffusion model in h-space
responsible alignment 연구들이 괜찮은 성능을 내지만, 여전히 inappropriate content를 generate함
어떻게 극복할 수 있을까?
→ h-space에서 direct manipulation하면 어떨까?
기존의 h-space 연구들로는 inappropriate concept의 direction을 찾는 것이 쉽지 않음
(unsupervised approach는 direction을 못찾을 수 있고, supervised approach는 classifier를 학습해야해서 번거로움)
해당 논문의 3가지 main contribution은 다음과 같음
responsible generation : fair generation, safe generation, responsible text-enhancing generation
정리하면...
self-discovery method를 이용해서 원하는 concept의 h-space direction을 알 수 있으며
h-space manipulation이라는 approach로 inappropriate generation을 mitigate할 수 있음을 보임
self-discovery method를 이용해서 원하는 concept의 h-space direction을 알 수 있는 것이 논문의 핵심이라고 보면 됨
중요하다고 생각되는 부분만 간단히 요약
1. Approach
1.1. Finding a Semantic Concept
1.2. Responsible Generation with Self-discovered Interpretable Latent Direction
2. Experiments
2.1. Fair Generation
2.2. Safe Generation
2.3. Enhancing Responsible Text Guidance
2.4. Semantic Concepts
3. Appendix
3.1. Approach
3.2. Experiment for Fair Generation
3.3. Hyperparameters for Safety Experiments
3.4. Responsible Text-enhancing Benchmark
3.5. Semantic Concepts Visualizations
Interpolation
Composition
Generalization
3.6. Ablation Study