EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion

This repo contains PyTorch model definitions, pre-trained weights and inference code for our video generation model, EchoVideo.

EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion

News

[2025.02.27] We release the inference code and model weights of EchoVideo. DownLoad

Introduction

EchoVideo is capable of generating a personalized video from a single photo and a text description. It excels in addressing issues related to "semantic conflict" and "copy-paste" problems. And demonstrates state-of-the-art performance.

Gallery

Strongly recommend visiting this link for more results.

1. Text-to-Video Generation

Face-ID Preserving	Full-Body Preserving
echoVideo_face.mp4	echoVideo_body.mp4

2. Comparisons

EchoVideo	ConsisID	IDAnimator
echoVideo_1.mp4	consisID_1.mp4	idanimator_1.mp4
echoVideo_2.mp4	consisID_2.mp4	idanimator_2.mp4

Usage

Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12. Support both gpu and npu

Clone the repository:

git clone https://github.com/bytedance/EchoVideo
cd EchoVideo

Installation

pip install -r requirements.txt

Download Pretrained Weights

The details of download pretrained models are shown here.

Run Demo

# multi-resolution video generation [(480, 640), (480, 848), (480, 480), (848, 480), (640, 480)]
python infer.py

Methods

Overall Architecture

Overall architecture of EchoVideo. By employing a meticulously designed IITF module and mitigating the over-reliance on input images, our model effectively unifies the semantic information between the input facial image and the textual prompt. This integration enables the generation of consistent characters with multi-view facial coherence, ensuring that the synthesized outputs maintain both visual and semantic fidelity across diverse perspectives.

Key Features

Illustration of facial information injection methods. (a) IITF. Facial and textual information are fused to ensure consistent guidance throughout the generation process. we propose IITF to fuse text and facial information, establishing a semantic bridge between facial and textual information, coordinating the influence of different information on character features, thereby ensuring the consistency of generated characters. IITF consists of two core components: facial feature alignment and conditional feature alignment. (b) Dual branch. Facial and textual information are independently injected through Cross Attention mechanisms, providing separate guidance for the generation process.

Benchmark

Model	Identity Average↑	Identity Variation↓	Inception Distance↓	Dynamic Degree↑
IDAnimator	0.349	0.032	159.11	0.280
ConsisID	0.414	0.094	200.40	0.871
pika	0.329	0.091	268.35	0.954
Ours	0.516	0.075	176.53	0.955

Acknowledgements

CogVideo: The DiT module we adpated from, and the VAE module we used.
SigLip: Vision Encoder we used.

BibTeX

If you find our work useful in your research, please consider citing the paper

@article{wei2025echovideo,
  title={EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion},
  author={Wei, Jiangchuan and Yan, Shiyue and Lin, Wenfeng and Liu, Boyuan and Chen, Renjie and Guo, Mingyu},
  journal={arXiv preprint arXiv:2501.13452},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
asset/examples		asset/examples
ckpts		ckpts
models		models
LICENSE		LICENSE
MODEL_LICENSE		MODEL_LICENSE
README.md		README.md
infer.py		infer.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion

News

Introduction

Gallery

1. Text-to-Video Generation

2. Comparisons

Usage

Clone the repository:

Installation

Download Pretrained Weights

Run Demo

Methods

Overall Architecture

Key Features

Benchmark

Acknowledgements

BibTeX

About

Releases

Packages

Languages

License

bytedance/EchoVideo

Folders and files

Latest commit

History

Repository files navigation

EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion

News

Introduction

Gallery

1. Text-to-Video Generation

2. Comparisons

Usage

Clone the repository:

Installation

Download Pretrained Weights

Run Demo

Methods

Overall Architecture

Key Features

Benchmark

Acknowledgements

BibTeX

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages