-
Notifications
You must be signed in to change notification settings - Fork 5.9k
LTX Video 0.9.7 #11516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
LTX Video 0.9.7 #11516
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Hello Aryan, Please also consider 0.9.6, 2 models in this. |
Oh, I was under the impression that it was already supported by someone else's PR :/ I'll try to take a look after this PR is complete |
Unfortunately the output was not upto the mark. I guess there must be few changes in code w.r.t 0.9.6 |
Is this correct code for T2V? import torch
from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
from diffusers.utils import export_to_video, load_video
pipe = LTXConditionPipeline.from_pretrained(
"a-r-r-o-w/LTX-Video-0.9.7-diffusers",
torch_dtype=torch.bfloat16,
)
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_tiling()
prompt = "A woman with light skin, wearing a blue jacket and a black hat with a veil, looks down and to her right, then back up as she speaks; she has brown hair styled in an updo, light brown eyebrows, and is wearing a white collared shirt under her jacket; the camera remains stationary on her face as she speaks; the background is out of focus, but shows trees and people in period clothing; the scene is captured in real-life footage."
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
generator = torch.Generator(device="cuda").manual_seed(42)
video = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
width=1024, # 1216
height=576, # 704
num_frames=121, # 257
num_inference_steps=30,
guidance_scale=1,
generator=generator
).frames[0]
export_to_video(video, "LTX097_1.mp4", fps=24) LTX097_1.mp4 |
Looks correct to me except guidance scale. I don't think 0.9.7 is guidance-distilled, so it probably needs a higher guidance scale, like 5.0, to produce good results |
Additionally, the recommended values for codeimport torch
from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
from diffusers.utils import export_to_video, load_video
pipe = LTXConditionPipeline.from_pretrained("/raid/aryan/diffusers-ltx/ltx_pipeline", torch_dtype=torch.bfloat16)
pipe.to("cuda")
pipe.vae.enable_tiling()
prompt = "A woman with light skin, wearing a blue jacket and a black hat with a veil, looks down and to her right, then back up as she speaks; she has brown hair styled in an updo, light brown eyebrows, and is wearing a white collared shirt under her jacket; the camera remains stationary on her face as she speaks; the background is out of focus, but shows trees and people in period clothing; the scene is captured in real-life footage."
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
generator = torch.Generator().manual_seed(42)
video = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
width=1024, # 1216
height=576, # 704
num_frames=121, # 257
num_inference_steps=30,
guidance_scale=5.0,
decode_timestep=0.05,
image_cond_noise_scale=0.025,
generator=generator
).frames[0]
export_to_video(video, "output3.mp4", fps=24) output3.mp4 |
Thank you, will try now. |
Here's an img2vid version that works but results could be better, not sure why
LTX097_img2vid.5.mp4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @a-r-r-o-w
|
||
latent_sigma = None | ||
if denoise_strength < 1: | ||
sigmas, timesteps, num_inference_steps = self.get_timesteps( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we need to call this first, get timesteps
and sigmas
and use that set_timesteps on scheduler? (
retrieve_timesteps`)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, nice catch. I'll test and fix this tomorrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've verified this is the same order as the original inference code and intended. I think the only thing remains is moving self._num_timesteps = len(timesteps)
below this.
There are some more changes in the original codebase related to removing some boundaries of conditioning latents. I'm not yet sure how effective that is (we should support it anyway), so I'll test more and then merge this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at changes related to stripping the latent boundaries and other changes that were added. It seems like these are additional generation quality related improvements and are not required for the basic sampling mechanism in itself. Since we want to keep our pipeline examples as simple as possible, I think it's alright to not support it, and instead we can look at adding more complex logic like this in modular diffusers.
LMK if you'd like me to write a full LTXMultiscalePipeline (similar to the official repo) by combining the normal and upscale LTX pipelines to wrap the example code logic + stripping latent boundaries mentioned above
raise AttributeError("Could not access latents of provided encoder_output") | ||
|
||
|
||
class LTXLatentUpsamplePipeline(DiffusionPipeline): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this technically is just an optional component I think? it does not have a denoising loop
ok if it is faster to get the model in this way & easier to use. I will leave it up to you to decide
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think LTX team may release more models and handling everything in the same pipeline will make it harder. For example, they have a temporal upscaler that was released as well but I don't see any inference code for it yet so I haven't added it here.
Also, the upsampler seems to be usable with other LTX models and not just 0.9.7, so I think makes sense to keep a separate pipeline, otherwise we'll have to add to all three pipelines, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good -
Using this (instead of pipe.to("cuda")/pipe_upsample.to("cuda")):
in the Full Inference example, I get this error:
That said, the conditional pipeline is a winner! Amazing to be able to load a pipeline supporting txt, img and video input in one go! |
I don't think it will work.
This model is not for me (can't provide code) as it is taking 1 hr for inference (can't complain as model is too big for 8 GB VRAM) |
@nitinmukesh This code with bitsandbytes quantizing the transformer works for txt, img and vid input and is running on 14 GB VRAM. Maybe sequential_offload will bring it further down(should properly add it as low_ram handling): https://github.com/tin2tin/Pallaidium/blob/fa1da79faf817227a204da15cdfae4dfb0d5452e/__init__.py#L3170 The error I got was in the "Full Inference" example. |
0.9.6 distilled is no longer needed. I tested with LTXPipeline and LTXImageToVideoPipeline and distilled model is working as expected. Time to test 0.9.6-dev. |
Unfortunately sequential doesn't work with nf4. I did logged issue but the response didn't made sense to me. I did quantized hunyuancommunity model to nf4 so it should have worked. |
Checkpoints (only for the time being; unofficial):
Standalone latent upscale pipeline test (image):
Standalone latent upscale pipeline test (video):
Full inference:
output.mp4