Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to train the first stage? #3

Open
Weifeng-Chen opened this issue Sep 19, 2022 · 15 comments
Open

How to train the first stage? #3

Weifeng-Chen opened this issue Sep 19, 2022 · 15 comments

Comments

@Weifeng-Chen
Copy link

Hi, do you train your text encoder in CLIP way without latent diffusion. Or maybe train it with the diffusion model, how's the loss function and other detail? Would you like to share more detail about the training?

@ScottishFold007
Copy link

Hi, do you train your text encoder in CLIP way without latent diffusion. Or maybe train it with the diffusion model, how's the loss function and other detail? Would you like to share more detail about the training?

This I make it,with direct training clip model, but freeze the weight of the vit model, increase the batch can be, 64 characters can be opened on the colab p100 96batch. So easy@!

@Weifeng-Chen
Copy link
Author

Weifeng-Chen commented Sep 19, 2022 via email

@Weifeng-Chen
Copy link
Author

Hi, do you train your text encoder in CLIP way without latent diffusion. Or maybe train it with the diffusion model, how's the loss function and other detail? Would you like to share more detail about the training?

This I make it,with direct training clip model, but freeze the weight of the vit model, increase the batch can be, 64 characters can be opened on the colab p100 96batch. So easy@!

oh , I have train my Chinese CLIP as well. So the thing I need to do is to jointly train with latent diffusion model?

@ScottishFold007
Copy link

ScottishFold007 commented Sep 19, 2022 via email

@Weifeng-Chen
Copy link
Author

直接说中文吧。我训练完,输入中文能出效果还可以的图,就是中国化差点,你第一步做完后可以出效果吗?还是说,输入中文,出来的结果很差,没理解语义

---Original--- From: @.> Date: Mon, Sep 19, 2022 21:34 PM To: @.>; Cc: @.@.>; Subject: Re: [rinnakk/japanese-stable-diffusion] How to train the first stage?(Issue #3) Hi, do you train your text encoder in CLIP way without latent diffusion. Or maybe train it with the diffusion model, how's the loss function and other detail? Would you like to share more detail about the training? This I make it,with direct training clip model, but freeze the weight of the vit model, increase the batch can be, 64 characters can be opened on the colab p100 96batch. So easy@! oh , I have train my Chinese CLIP as well. So the thing I need to do is to jointly train with latent diffusion model? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

我可能得重训一下。。。他这个cross-attention机制,我那个模型维度可能对不上。你是直接没做进一步的finetune,就可以拿训好的CLIP去做中文生成了?

@ScottishFold007
Copy link

ScottishFold007 commented Sep 19, 2022 via email

@ScottishFold007
Copy link

ScottishFold007 commented Sep 19, 2022 via email

@Weifeng-Chen
Copy link
Author

模型维度得是768的hidden,其次最好是CLIP的text encoder(单向的causal attention), 而非BERT encoder,直接对齐VIT-L可以很好的迁移

我一开始训的是CLIP的text encoder,但是训出来的在下游zero-shot表现一般般。(主要是词表的问题),但是生成类的似乎不太好评价下游效果。我后面会试试对齐一下维度做做看。谢谢大佬!!

@ScottishFold007
Copy link

模型维度得是768的hidden,其次最好是CLIP的text encoder(单向的causal attention), 而非BERT encoder,直接对齐VIT-L可以很好的迁移

嗯,这次我就是这么干的。但单从clip中文适配效果而言,中文bert和clip中的vit拼接,再做fine tune效果更佳

@ScottishFold007
Copy link

模型维度得是768的hidden,其次最好是CLIP的text encoder(单向的causal attention), 而非BERT encoder,直接对齐VIT-L可以很好的迁移

话又说回来,要向sd的“中国化”效果较好,第二步还是得走的,text encoder/unet/vae联合训练,其中的个别可以冻住

@Weifeng-Chen
Copy link
Author

模型维度得是768的hidden,其次最好是CLIP的text encoder(单向的causal attention), 而非BERT encoder,直接对齐VIT-L可以很好的迁移

我一开始训的是CLIP的text encoder,但是训出来的在下游zero-shot表现一般般。(主要是词表的问题),但是生成类的似乎不太好评价下游效果。我后面会试试对齐一下维度做做看。谢谢大佬!!

我是直接魔改了vocab和tokenizer,把vocab改成了BERT的,然后做了几个trick:

  1. [CLS] XXXXXX [SEP] [PAD] [PAD] 改成了[CLS ] XXXXXX [PAD] [PAD] [PAD]
  2. 把原生CLIP权重[BOS]复制到[CLS] , 把原生CLIP权重[EOS]复制给[PAD]对齐
  3. 这样我可以直接从原生CLIP的权重微调并且可以很快速的迁移到中文的BERT的Vocab

emmm 你是在多大规模上finetune的? 我是直接用上亿数据去做训练的,所以用加载中文的robert预训练模型,效果会好很多。我记得之前用原生的不改词表大概0.2几,用中文Robert能到0.4几(在imagenet1k翻译过来的中文版)。也试过用clip的权重,只换掉词表,效果也不行,你的trick可能还是蛮重要的(但是这个实验成本挺高的hhh,可能得换小点数据看看)

@ScottishFold007
Copy link

你这个是vit和text的权重都训练吗?我是放开text模型的权重进行训练,vit的冻住,不然16g的显存玩不动,但效果还可以,千万级的训练数据。

@Weifeng-Chen
Copy link
Author

我用的LAION-5B的中文subset做的,大约1亿数据,你说的分数是指相似度匹配的分数吗,如果是相似度匹配,CLIP原生的匹配分都很低,但是指导stable diffusion的效果不错

laion中文subset居然有这么大。我用的是诺亚开源的wukong和360开源的zero。后面用来指导disco diffusion的生成,可以生成很多中文互联网域内的图片了。但是stable 这个由于我之前用的维度转了一下,所以不能直接套。。

@JunnYu
Copy link

JunnYu commented Sep 26, 2022

请问可以使用现有的中文CLIP text encoder权重吗?如:https://github.com/PaddlePaddle/ERNIE/tree/ernie-kit-open-v1.0/Research/ERNIE-ViL2 (这个是双向语言模型,不是单向的)这个模型的hidden states是768。
固定住这个ERNIE-ViL2 text encoder和vae后,只训练微调unet是不是就可以了呢?

@ScottishFold007
Copy link

请问可以使用现有的中文CLIP text encoder权重吗?如:https://github.com/PaddlePaddle/ERNIE/tree/ernie-kit-open-v1.0/Research/ERNIE-ViL2 (这个是双向语言模型,不是单向的)这个模型的hidden states是768。 固定住这个ERNIE-ViL2 text encoder和vae后,只训练微调unet是不是就可以了呢?

应该是不行的,现有的sd是基于openai的clip训练的;百度这个不行,第二阶段的unet VAE和clip都要有联系才行

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants