[2024 CVPR] Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation

text-to-image diffusion model은 biased or harmful images와 같은 inappropriate content를 generate할 수 있음

어떻게 하면 diffusion model이 inappropriate content를 generate하지 못하게 만들 수 있을까?


<details><summary>previous work - responsible alignment of diffusion models</summary>

> diffusion model의 responsible alignment 관련 기존 연구들을 크게 4가지로 분류할 수 있음
> 
> 1. training dataset 정제해서 diffusion model 학습하기
>   refine the training dataset to remove biased and inappropriate content
>   limitation : computationally intensive, may not fully eliminate harmful content, degrade the model's performance
> 2. pre-trained diffusion model fine-tune하기
>   fine-tune the parameters of pre-trained models, aiming to remove the model's representation capability of generating such inappropriate concepts 
>   limitation : require a potentially exhaustive list of words that introduce biases and harmful concepts, sensitive to the adaptation process and may result in the degradation of the original models
> 3. input prompt 보고 걸러내기
>   detect and filter out inappropriate words from the input prompts
>  limitation : fails to address non-explicit phrases that can still yield inappropriate outputs
> 4. classifier-free guidance 사용하기
>  utilize classifier-free guidance to direct the generated images away from undesirable content during inference

</details>


<details><summary>previous work - interpreting diffusion model in h-space</summary>

> `Diffusion Models already have a Semantic Latent Space`
> U-Net의 bottleneck layer를 semantic representation space로 볼 수 있으며, 이를 h-space로 명칭
> h-space에서의 manipulation을 통해 specific semantic concept을 반영한 image generation 가능

> specific semantic concept의 direction을 어떻게 찾을 수 있는가?
> 
> 1. unsupervised approach
> found vectors must be interpreted with humans in a loop 
> number of interpretable directions depends on the training data
> not clear to which semantic concepts those identified vectors correspond
> some target concepts may not be found in the discovered directions
> 2. supervised approach
> require training external attribute classifiers supervised by human annotations
> quality of the identified vectors is sensitive to the classifier's performance
> new concepts require the training of new classifiers

> 정리하면...
> unsupervised approach는 사람 손도 거치면서, 우리가 원하는 target concept direction을 못찾을 수 있음
> supervised approach는 classifier를 학습해야하며, classifier 성능에 sensitive하며, 새로운 concept에 대한 direction을 찾으려면 classifier를 다시 학습해야함

</details>


responsible alignment 연구들이 괜찮은 성능을 내지만, 여전히 inappropriate content를 generate함

어떻게 극복할 수 있을까?
→ h-space에서 direct manipulation하면 어떨까?

기존의 h-space 연구들로는 inappropriate concept의 direction을 찾는 것이 쉽지 않음
(unsupervised approach는 direction을 못찾을 수 있고, supervised approach는 classifier를 학습해야해서 번거로움)

해당 논문의 3가지 main contribution은 다음과 같음

1. external model, labeled data 없이 h-space에서 원하는 concept의 direction을 찾는 self-discovery method를 제안
2. discovered concept vector를 이용하여 responsible generation이 가능함을 보임
  responsible generation : fair generation, safe generation, responsible text-enhancing generation
3. 해당 방법으로 좋은 성능을 냄

정리하면...
self-discovery method를 이용해서 원하는 concept의 h-space direction을 알 수 있으며
h-space manipulation이라는 approach로 inappropriate generation을 mitigate할 수 있음을 보임

self-discovery method를 이용해서 원하는 concept의 h-space direction을 알 수 있는 것이 논문의 핵심이라고 보면 됨

중요하다고 생각되는 부분만 간단히 요약


## 1. Approach

<details><summary>1.1. Finding a Semantic Concept</summary>

> 내가 원하는 concept의 interpretable direction을 어떻게 찾을 수 있을까?
> 기존 연구들은 원하는 concept의 interpretable direction을 찾기 위해 human labeled data, classifier를 사용했음
> 해당 방법들은 scalable하지 않음
> diffusion model로 image generation해서 data를 구축하면 어떨까?
> → concept이 포함된/포함되지 않은 prompt + pre-trained model로 image generation하자

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/e98c9965-d14e-4192-98c9-2b9efd2869ce">

> `Figure 1`
> female concept의 interpretable direction 찾는 방법
> 1. concept이 포함된 y+ prompt `a photo of a female face`로 x+ images 생성
> 2. concept이 포함되지 않은 y- prompt `a photo of a face`로 generation하는데, x+ images가 생성되도록 concept vector를 optimize

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/d62cc217-2355-4bc3-8dc2-3307edde2d13">

> pre-trained model을 frozen하고, reconstruction error를 minimize하도록 optimize
> → female image를 생성하도록 optimize되기에, context vector c는 female concept을 배우게 됨
>
> 참고로 concept vector는 single vector로, timestep과 무관함
> (모든 timestep에서 single vector가 더해지게 됨)
> 
> inference할 때는 매 decoding step마다 h-space의 original activations에 context vector를 더해줌
> (concept vector is added to the original activations in h-space at each decoding step)

</details>


<details><summary>1.2. Responsible Generation with Self-discovered Interpretable Latent Direction</summary>
<img width="50%" alt="Figure 2" src="https://github.com/user-attachments/assets/5c7f5153-5061-4bbf-87c7-ba3edd15a3b3">

> Fair Generation Method (`Figure 2`)
> purpose : to prevent generation of biased societal groups
> train : learn a semantic concepts representing different societal groups
> inference : a concept vector is sampled from the learned concepts in the societal group with equal probability

<img width="50%" alt="Figure 3" src="https://github.com/user-attachments/assets/2eb62757-55d9-4a2d-b5df-a79267ea6094">

> Safe Generation Method (`Figure 3`)
> purpose : to prevent generation of inappropriate content
> train : learn the opposite latent direction of an inappropriate concept
> 
> ex.
> learn the concept of anti-sexual
> y+ prompt : a gorgeous person (with negative prompt sexual)
> y- prompt : a gorgeous person

<img width="50%" alt="Figure 4" src="https://github.com/user-attachments/assets/c07f06fc-757a-415d-bcaf-07bb5bd68f6e">

> Responsible Text-enhancing Generation Method (`Figure 4`)
> purpose : to make generative models accurately incorporate all the concepts defined in the prompt
> train : learn concepts such as gender, race, safety
> inference : extract safety-related concepts from prompt and apply to original activations
> 
> 사실상 fair, safe generation과 다를바가 없음
> 다른 점이라고 하면 text prompt에 fair, safe concept이 명시되어있다는 것
> text prompt에 있는 fair, safe concept이 무시되지 않고 잘 반영되도록 해당 concept을 더하겠다라는 것

</details>


## 2. Experiments

<details><summary>2.1. Fair Generation</summary>

> Task
> to increase the diversity of societal groups in the generated images, particularly in professions where existing models exhibit gender and racial bias

> Dataset
> Winobias benchmark with original templates, hard templates
> (ex. original templates : `a portrait of a doctor`, hard templates : `a portrait of a successful doctor`

> Evaluation Metric
> target : gender (male, female), racial (black, white, asian)
> use deviation ratio to quantify the imbalance of different attributes
> use CLIP classifier to predict attributes

> Approach Setting
> Stable Diffusion 1.4 with 7.5 guidance scale
> find 5 concept vectors (male, female, black, white, asian) using a base prompt `person`
> (ex. y+ : `a photo of a woman`, y- : `a photo of a person` → learn the concept female)
>
> concept vectors are optimized for 10K steps on 1K synthesized images for each concept
> directly employ the learned vector without any scaling

<img width="100%" alt="image" src="https://github.com/user-attachments/assets/ee0bc6e2-e290-47fc-bb60-4901d9dd71e5">

> `Table 1`
> our approach is significantly better than the original SD and outperforms the state-of-the-art debiasing approach UCE
> despite the presence of bias in the text prompts, our approach consistently performs well as it directly operates on the latent visual space
> → generalization capability of our approach to different text prompts

<img width="100%" alt="image" src="https://github.com/user-attachments/assets/9c66d5c8-3079-4e5c-ae35-9a7ab18641b1">

> `Figure 5`
> quality of images generated by our approach remains consistent with the original SD

</details>


<details><summary>2.2. Safe Generation</summary>

> Task
> eliminate harmful content specified in inappropriate prompts

> Dataset
> I2P benchmark : 4703 inappropriate prompts from real-world user prompts
> (ex. illegal activity, sexual, violence)

> Evaluation Metric
> accuracy
> use Nudenet detector, Q16 classifier to detect nudity or violent content
> if image is classified as inappropriate if any of the classifiers predict as positive

> Approach Setting
> learn the concept vector for each inappropriate concept defined in the I2P dataset
> (ex. anti-sexual)
>
> certain concepts are rather abstract and include diverse visual categories
> adding these concepts improves safety yet at a higher cost of image quality degradation
> (ex. hate)
> → use only `anti-sexual`, `anti-violence`
> 
> identified concept vectors are linearly combined as the final vector

<img width="100%" alt="image" src="https://github.com/user-attachments/assets/a8e3deb3-603d-4660-9b72-f53e1e5b17ad">

> `Table 2`
> our safety vector can suppress inappropriate concepts that existing approaches failed to eliminate

</details>


<details><summary>2.3. Enhancing Responsible Text Guidance</summary>

> Task
> accurately represent the responsible phrases in the prompt in the generated image, if user prompts classified as responsible text

> Dataset
> create a dataset of 200 prompts that explicitly include responsible concepts
> gender and race fairness, removal of sexual and violent content
> (ex. `a fair-gender doctor is operating a surgery`, `a picture of a loved couple, without sexual content`)

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/a77b138f-653b-4e7d-985f-f7d8e06c9b03">

> `Table 3`
> our approach effectively enhances the text guidance for responsible instructions

</details>


<details><summary>2.4. Semantic Concepts</summary>

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/c835eebc-14ed-4a36-9739-1cc06a647759">

> `Figure 6` - interpolation
> impact of manipulating image semantics by linearly controlling the strength of the concept vector
> the image is gradually modified to the introduced concept by adjusting the added vector's strength
> the smooth transition indicates that the discovered vector represents the target semantic concept while remaining approximately disentangled from other semantic factors

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/770b27fe-d223-4f81-a11e-2e9086bbb461">

> `Figure 7` - composition
> by linearly combining these concept vectors, we can control the corresponding attributes in the generated image
> → composability of learned concept vectors

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/d3f3f7b5-9b93-48f2-ae86-c75ccf699ba5">

> `Figure 8` - generalization
> train the latent vector for the concept `running` on generated dog images and test its effect on other objects using prompts such as `a photo of a cat`
> although the vector of `running` was learned from dogs, it successfully extends to different animals and even humans
> → generalization capability of our discovered concept vector to universal semantic concepts

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/cc2c05d4-2f6a-452c-9c88-022f5adbbced">

> `Table 4` - impact on image quality
> quality of generated images remains approximately the same level as the original SD

</details>


## 3. Appendix

<details><summary>3.1. Approach</summary>

<img width="100%" alt="image" src="https://github.com/user-attachments/assets/0e33b129-ee76-46b1-9a05-0e664eafb165">

<img width="49%" alt="image" src="https://github.com/user-attachments/assets/f0cdcb4c-b27b-4fa9-a33e-7357606fc0de">
<img width="49%" alt="image" src="https://github.com/user-attachments/assets/8acb7f39-5b5c-4c61-beec-0b6e32cf5705">
<img width="49%" alt="image" src="https://github.com/user-attachments/assets/ba39ac76-9576-45af-b29e-1635fa45f477">

---

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/43688d87-be25-4423-aeaf-6024a3a54b08">

> `Table 5`
> negative scaling : learn the concept directly and apply negative scaling
> (ex. learn the `sexual` concept vector directly and obtain `anti-sexual` by applying a negative scaling)
> negative prompt approach (+`anti-sexual`) outperforms the negative scaling approach (−`sexual`)
> 
> backpropagating on the `anti-sexual` vector directly aligns with the objective of minimizing harmful content
> negative scaling of the concept vector is more challenging as it involves extrapolating the learned vector into untrained directions
> nevertheless, both approaches yield significantly better results than the original SD

</details>


<details><summary>3.2. Experiment for Fair Generation</summary>

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/9b553d91-24a8-4dfb-ad9f-d13e9b1c1bad">

> `Table 6`
> CLIP score evaluation on generated images from Winobias prompts
> generated image is compared with the text used to generate it
> similarity between the text embedding and image embedding is computed
> (higher scores indicating better performance)
> this experiment only quantifies the semantic alignment between the image and the input text, without considering the gender or race of the generated image

</details>


<details><summary>3.3. Hyperparameters for Safety Experiments</summary>

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/34daf64e-712b-4e36-90d2-79d30c756c33">

> `Figure 10`
> as we combine more concept vectors, our approach effectively removes more harmful content
> however, we observed a decrease in image quality
> we find that when the concept vector has a large magnitude, it tends to shift the image generation away from the input text prompt

</details>


<details><summary>3.4. Responsible Text-enhancing Benchmark</summary>
<img width="100%" alt="image" src="https://github.com/user-attachments/assets/47924371-5350-40ba-acc9-277f931417fd">

> use GPT-3.5 to generate text with specified responsible phrases across 4 categories
> gender fairness, race fairness, nonsexual content, nonviolent content

</details>


<details><summary>3.5. Semantic Concepts Visualizations</summary>

### Interpolation

> generation process of diffusion models involves multiple factors, such as sequential operations, manipulating a single attribute precisely using a linear vector is challenging
> to ensure that the generated image remains close to the original image, we apply a technique inspired by SDEdit
> during generation, we use a simple average operation
> 
> $x\_{t} = (x\_{t}^{(y)} + x\_{t}^{(c, y)}) / 2$
> average between (output without concept vectors, output with concept vectors)
> this approach helps preserve more semantic structures from the original image


### Composition

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/26e6f496-31dd-4687-9e99-83379db28c15">

> `Table 8`
> composing vectors performed similarly to applying a single vector
> → effectiveness of the linear composition of concepts in the semantic space


### Generalization

<img width="100%" alt="image" src="https://github.com/user-attachments/assets/133b3e0f-8040-49c8-84db-2e7326c836bf">

> `Figure 14`
> concepts learned from particular images capture more general properties that can be generalized to different prompts with similar semantics

</details>


<details><summary>3.6. Ablation Study</summary>

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/fd9c62d2-e87f-4b14-afed-c65925a2ee1e">

> `Figure 11 (left)` - number of training images
> as long as the number of samples reached a reasonable level, the specific number of unique images had less impact on the performance

> `Figure 11 (right)` - number of unique training prompts
> number of unique prompts had less impact on the overall performance
> learning with a particular profession is more challenging than learning with a generic prompt such as `a person`
> adding various prompts leads to a slight improvement, but less significant than adding the number of training samples

<img width="100%" alt="image" src="https://github.com/user-attachments/assets/37afa03c-1a9e-4b1a-af11-33a89367967c">

> `Figure 15` - concept discovery with realistic dataset
> using CelebA, our approach can find the semantic concepts for Stable Diffusion

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2024 CVPR] Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation #228

1. Approach

2. Experiments

3. Appendix

Interpolation

Composition

Generalization

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[2024 CVPR] Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation #228

Description

1. Approach

2. Experiments

3. Appendix

Interpolation

Composition

Generalization

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions