All models are stored in EchoVideo/ckpts
, and the file structure is as follows
EchoVideo/ckpts/
├── clip_visual_encoder
├── configuration.json
├── face_encoder
├── model_index.json
├── README.md
├── scheduler
│ └── scheduler_config.json
├── text_encoder
├── tokenizer
├── transformer
├── config.json
├── diffusion_pytorch_model-00001-of-00002.safetensors
├── diffusion_pytorch_model-00002-of-00002.safetensors
└── diffusion_pytorch_model.safetensors.index.json
└── vae
To download the EchoVideo model, first install the huggingface-cli.
python -m pip install "huggingface_hub[cli]"
Then download the model using the following commands:
# Switch to the directory named 'EchoVideo'
cd EchoVideo
huggingface-cli download bytedance-research/EchoVideo --local-dir ./ckpts
- Download Face Recognition model and put them into ckpts/face_encoder/ folder.
mkdir -p ckpts/face_encoder
cd ckpts/face_encoder
unzip antelopev2.zip -d models
- Face Parse (face_encoder folder), we use the bisenet model.
cd ckpts/face_encoder
wget https://github.com/xinntao/facexlib/releases/download/v0.1.0/detection_Resnet50_Final.pth
wget https://github.com/xinntao/facexlib/releases/download/v0.2.2/parsing_parsenet.pth
wget https://github.com/xinntao/facexlib/releases/download/v0.2.0/parsing_bisenet.pth
Visual encoder (clip_visual_encoder folder), we use the SigLip model.
cd ckpts
huggingface-cli download google/siglip-base-patch16-224 --local-dir ./clip_visual_encoder
Text encoder (text_encoder folder), Tokenizer (tokenizer folder), VAE (vae folder), we follow the CogVideoX
- Download text encoder and put them into ckpts/text_encoder folder.
- Download tokenizer and put them into ckpts/tokenizer folder.
- Download vae and put them into ckpts/vae folder.