CogVideoXVideoToVideoPipeline use text to video model instead of image to video model why ? #10996

Daromog · 2025-03-07T04:11:13Z

Daromog
Mar 7, 2025

Hello,

I have a question so I was looking at the pipelines of Cogvideox:
https://huggingface.co/docs/diffusers/en/api/pipelines/cogvideox#diffusers.CogVideoXVideoToVideoPipeline

in diffusers and I noticed that the videotovideo pipeline (class diffusers.CogVideoXVideoToVideoPipeline) uses the model form cogvideo that is text to video instead of the image to video pipeline, which is counterintuitive. Example:

`import torch
from diffusers import CogVideoXDPMScheduler, CogVideoXVideoToVideoPipeline
from diffusers.utils import export_to_video, load_video

pipe = CogVideoXVideoToVideoPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16)
pipe.to("cuda")
pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config)

input_video = load_video("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/hiker.mp4")
prompt = ("An astronaut stands triumphantly at the peak of ....")

video = pipe(video=input_video, prompt=prompt, strength=0.8, guidance_scale=6, num_inference_steps=50).frames[0]
export_to_video(video, "output.mp4", fps=8)`

The logical thing would be to go from the image to video pipeline (THUDM/CogVideoX-5b-I2V) to the video to video pipeline, but instead it uses the text to video model ("THUDM/CogVideoX-5b") why is this ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CogVideoXVideoToVideoPipeline use text to video model instead of image to video model why ? #10996

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

CogVideoXVideoToVideoPipeline use text to video model instead of image to video model why ? #10996

Daromog Mar 7, 2025

Replies: 0 comments

Daromog
Mar 7, 2025