You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
in diffusers and I noticed that the videotovideo pipeline (class diffusers.CogVideoXVideoToVideoPipeline) uses the model form cogvideo that is text to video instead of the image to video pipeline, which is counterintuitive. Example:
`import torch
from diffusers import CogVideoXDPMScheduler, CogVideoXVideoToVideoPipeline
from diffusers.utils import export_to_video, load_video
video = pipe(video=input_video, prompt=prompt, strength=0.8, guidance_scale=6, num_inference_steps=50).frames[0]
export_to_video(video, "output.mp4", fps=8)`
The logical thing would be to go from the image to video pipeline (THUDM/CogVideoX-5b-I2V) to the video to video pipeline, but instead it uses the text to video model ("THUDM/CogVideoX-5b") why is this ?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello,
I have a question so I was looking at the pipelines of Cogvideox:
https://huggingface.co/docs/diffusers/en/api/pipelines/cogvideox#diffusers.CogVideoXVideoToVideoPipeline
in diffusers and I noticed that the videotovideo pipeline (class diffusers.CogVideoXVideoToVideoPipeline) uses the model form cogvideo that is text to video instead of the image to video pipeline, which is counterintuitive. Example:
`import torch
from diffusers import CogVideoXDPMScheduler, CogVideoXVideoToVideoPipeline
from diffusers.utils import export_to_video, load_video
pipe = CogVideoXVideoToVideoPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16)
pipe.to("cuda")
pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config)
input_video = load_video("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/hiker.mp4")
prompt = ("An astronaut stands triumphantly at the peak of ....")
video = pipe(video=input_video, prompt=prompt, strength=0.8, guidance_scale=6, num_inference_steps=50).frames[0]
export_to_video(video, "output.mp4", fps=8)`
The logical thing would be to go from the image to video pipeline (THUDM/CogVideoX-5b-I2V) to the video to video pipeline, but instead it uses the text to video model ("THUDM/CogVideoX-5b") why is this ?
Beta Was this translation helpful? Give feedback.
All reactions