diff --git a/README.md b/README.md
index 777beec5..25946f74 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,115 @@
-# LTX-2
+# LTX-2 Optimized (8GB of VRAM Edition) + Web UI
+
+This repository contains a **modified and optimized version of the LTX-2 Video Generation Model**, designed specifically to run on consumer hardware with as little as **8GB of VRAM**. 
+
+It includes a fully-featured **Gradio Web Interface** to make generating videos, managing presets, and applying LoRAs easy without needing to remember complex command-line arguments.
+## Web UI v2
+<img width="2260" height="1078" alt="image" src="https://github.com/user-attachments/assets/5a9f5dce-f313-44a3-bbbe-10eccc002191" />
+
+## Web UI v4
+<img width="949" height="575" alt="scr221" src="https://github.com/user-attachments/assets/2e3a6c51-51b8-4487-b64d-a7d2c41d794a" />
+
+## CinemaMaker UI
+<img width="612" height="343" alt="cm" src="https://github.com/user-attachments/assets/0f5f2dca-bacd-4f5f-a627-505ad3751277" />
+
+* https://youtu.be/eGOq0hUiri4
+* https://youtu.be/HAQqzPdDIj0
+
+## Music to Video UI
+<img width="612" height="343" alt="cm" src="https://github.com/user-attachments/assets/852845f9-f113-41f7-a5e6-8a4e1dec0778" />
+
+* https://youtu.be/HzK1nW-OVtQ
+
+
+
+## 🚀 Features
+
+*   **8GB VRAM Optimization:** Runs locally on cards like the RTX 3070/4060Ti using FP8 quantization and memory management tweaks.
+*   **Windows 11 support!!!** You can even run it on Windows (not supported in the original model).
+*   **User-Friendly Web UI:** Control everything from your browser.
+*   **Smart "Safe Mode":** The UI automatically limits the frame count based on selected resolution to prevent Out-Of-Memory (OOM) errors. (If you do not have 8GB of free VRAM, try decreasing the frame count.)
+*   **Real-time Logging:** View the generation progress and console output directly in the web interface.
+*   **Advanced Features:**
+    *   **Image Conditioning:** Upload reference images.
+    *   **LoRA Support:** Checkbox selection for Camera Control.
+    *   **Seed Control:** Reproducible generations.
+
+## 📥 Model Download & Setup
+
+To run this, you need to download the specific FP8 distilled checkpoints and the Text Encoder.
+
+**1. Create a `models` directory in the root folder:**
+```bash
+mkdir models
+mkdir models/loras
+```
+
+**2. Download the models:**
+* [`ltx-2-19b-distilled-fp8.safetensors`](https://huggingface.co/Lightricks/LTX-2/blob/main/ltx-2-19b-distilled-fp8.safetensors) - [Download](https://huggingface.co/Lightricks/LTX-2/resolve/main/ltx-2-19b-distilled-fp8.safetensors)
+* [`ltx-2-spatial-upscaler-x2-1.0.safetensors`](https://huggingface.co/Lightricks/LTX-2/blob/main/ltx-2-spatial-upscaler-x2-1.0.safetensors) - [Download](https://huggingface.co/Lightricks/LTX-2/resolve/main/ltx-2-spatial-upscaler-x2-1.0.safetensors)
+* [`Gemma 3`](https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-unquantized/tree/main)
+```
+./models/
+    ltx-2-19b-distilled-fp8.safetensors	
+    ltx-2-spatial-upscaler-x2-1.0.safetensors
+
+./models/gemma3/
+    gemma-3 files
+
+./models/loras/
+    LoRA files here
+```
+**3. Install all required modules:**
+```
+required modules
+pip install -e packages/ltx-pipelines
+pip install -e packages/ltx-core
+
+Python 3.12.8
+accelerate==1.10.1
+torch==2.8.0+cu128
+torchaudio==2.8.0+cu128
+torchvision==0.23.0+cu128
+xformers==0.0.32.post2
+...
+```
+**🖥️ Usage**
+Run the web interface with a single command:
+```Bash
+python web_ui_v2.py
+
+or
+
+python web_ui_v4.py
+```
+
+**📊 Performance & Presets (8GB of VRAM)**
+* The Web UI includes an "8GB VRAM Safe Mode" checkbox. When enabled, it enforces the following limits to ensure you don't crash your GPU. Est. inference time on RTX 3070 Ti laptop GPU ~300sec for all presets.
+```
+| Resolution  | Max Frames i2v| t2v  | Est. Time (3070ti laptop 8gb vram) |
+| :---------- | :------------ |:---- |:---------------------------------- |
+| 1280 x 704  | 177           | 257  | ~300..400 sec                      |
+| 1536 x 1024 | 121           | 185  | ~300..400 sec                      |
+| 1920 x 1088 | 81            | 121  | ~300..400 sec                      |
+| 2560 x 1408 | 49            | 65   | ~300..400 sec                      |
+| 3840 x 2176 | 17            | 25   | ~300..400 sec                      |
+* +60 sec for prompt (if not empty/not cached)
+* time to stage 1 preview 80..150 sec
+```
+* UPD: optimized transformer code, increased max frames by 40% for text to video, generation speed 300..315 -> 385..415 sec, (1280x704 11sec 24fps, 1920x1088 5sec 24fps)
+* UPD2: added web ui v4, stage 1 video preview, task queue, prompt constructor, disable audio option (faster inference 10-30%)
+
+
+**Credits**
+* Original Model: Lightricks (LTX-2)
+* Optimization: nalexand
+* Web UI: Created for the community to make this powerful model accessible.
+
+Original Model: 
+* (you can find links to all model files and loras below)
+
+
+## LTX-2
 
 [![Website](https://img.shields.io/badge/Website-LTX-181717?logo=google-chrome)](https://ltx.io)
 [![Model](https://img.shields.io/badge/HuggingFace-Model-orange?logo=huggingface)](https://huggingface.co/Lightricks/LTX-2)
diff --git a/film_maker_ui_v4.py b/film_maker_ui_v4.py
new file mode 100644
index 00000000..856548c5
--- /dev/null
+++ b/film_maker_ui_v4.py
@@ -0,0 +1,574 @@
+import gradio as gr
+import subprocess
+import os
+import datetime
+import threading
+import json
+import sys
+import google.generativeai as genai
+from collections import deque
+import cv2  # For frame extraction
+
+# --- Configuration & Defaults ---
+DEFAULT_CHECKPOINT = "./models/ltx-2-19b-distilled-fp8.safetensors"
+DEFAULT_GEMMA = "./models/gemma3"
+DEFAULT_UPSAMPLER = "./models/ltx-2-spatial-upscaler-x2-1.0.safetensors"
+
+# --- Master Prompt ---
+SYSTEM_INSTRUCTION = """
+You are a Creative Assistant. Given a user's raw input prompt describing a scene or concept, expand it into a detailed video generation script split into 5-8 short scenes (5 seconds each).
+Each scene must guide a text-to-video model with specific visuals and integrated audio.
+
+#### Crucial Generation Context
+- We generate scenes in CHRONOLOGICAL ORDER (starting from the first scene and moving towards the last).
+- The FIRST SCENE must be the MOST DETAILED, describing the environment, primary characters, and lighting with high precision to set the standard for the entire chain.
+- Subsequent scenes should maintain this description while focusing on their specific action and ensuring continuity from the previous scene.
+
+#### Continuity & Scene Construction
+- All scenes are connected by shared end/start frames.
+- Environment changes MUST OCCUR INSIDE a scene, not between scenes.
+- Each scene must be a direct continuation of the previous one.
+- Describe explicit CAMERA MOVEMENTS (e.g., "slow dolly in," "pan left," "handheld shake") within each scene.
+- Transitions or scene changes must be described as part of the visual action within the 5-second block.
+
+#### Guidelines
+- Strictly follow all aspects of the user's raw input.
+- If the input is vague, invent concrete details: lighting, textures, materials, scene settings, etc.
+- For characters: describe gender, clothing, hair, expressions. DO NOT invent unrequested characters.
+- NO SPEECH: Characters do not speak (this model produces video and background audio only). Describe reactions, expressions, and physical movements instead.
+- Use active language: present-progressive verbs ("is walking," "is grasping").
+- Maintain chronological flow within scenes: use temporal connectors ("as," "then," "while").
+- Audio layer: Describe complete soundscape integrated chronologically. Be specific (e.g., "distant thunder," "rustling leaves," "mechanical hum").
+- Style: Include visual style at the beginning: "Style: <style>, <rest of prompt>." Default to cinematic-realistic if unspecified.
+- Visual and audio only: NO non-visual/auditory senses.
+- NO timestamps or cuts within a single scene.
+- Each scene is a single continuous paragraph.
+
+Examples of good prompts:
+1. A warm sunny backyard. The camera starts in a tight cinematic close-up of a woman and a man in their 30s, facing each other with serious expressions. The woman, emotional and dramatic, says softly, “That’s it... Dad’s lost it. And we’ve lost Dad.”
+The man exhales, slightly annoyed: “Stop being so dramatic, Jess.”
+A beat. He glances aside, then mutters defensively, “He’s just having fun.”
+The camera slowly pans right, revealing the grandfather in the garden wearing enormous butterfly wings, waving his arms in the air like he’s trying to take off.
+He shouts, “Wheeeew!” as he flaps his wings with full commitment.
+The woman covers her face, on the verge of tears. The tone is deadpan, absurd, and quietly tragic.
+
+2. Static camera from inside the oven, looking outward through the slightly fogged glass door. Warm golden light glows around freshly baked cookies. The baker’s face fills the frame, eyes wide with focus, his breath fogging the glass as he leans in. Subtle reflections move across the glass as steam rises.
+Baker (whispering dramatically): “Today… I achieve perfection.”
+He leans even closer, nose nearly touching the glass.
+“Golden edges. Soft center. The gods themselves will smell these cookies and weep.”
+Baker: “Wait—”
+(beat)
+“Did I… forget the chocolate chips?”
+Cut to side view — coworker pops into frame, chewing casually.
+Coworker (mouth full): “Nope. You forgot the sugar.”
+Quick zoom back to the baker’s horrified face, pressed against the oven door, as cookies deflate behind the glass. Steam drifts upward in slow motion.
+pixar style acting and timing
+
+3. Soft studio lighting glows across a warm-toned set. The audience murmurs faintly as the camera pans to reveal three guests seated on a couch — a middle-aged couple and the show’s host sitting across from them.
+The host leans forward, voice steady but probing:
+Host: “When did you first notice that your daughter, Missy, started to spiral?”
+The woman’s face crumples; she takes a shaky breath and begins to cry. Her husband places a comforting hand on her shoulder, looking down before turning back toward the host.
+Father (quietly, with guilt): “We… we don’t know what we did wrong.”
+The studio falls silent for a moment. The camera cuts to the host, who looks gravely into the lens.
+Host (to camera): “Let’s take a look at a short piece our team prepared — chronicling Missy’s downward path.”
+The lights dim slightly as the camera pushes in on the mother’s tear-streaked face. The studio monitors flicker to life, beginning to play the segment as the audience holds its breath.
+
+4. Pinocchio is sitting in an interrogation room, looking nervous, and slightly sweating. He's saying very quietly to himself "I didn't do it... I didn't do it... I'm not a murderer". Pinocchio's nose is quickly getting longer and longer. The camera is zooming in on the double sided mirror in the back of the room, The mirror is turning black as the camera approaches it, and exposes a blurry silhouette of two FBI detectives who stand in the dark lit room on the other side. One of them is saying "I'm telling you, I have a feeling something is off with this kiddo
+
+#### Output Format (STRICT JSON)
+Return a JSON list of objects. Each object MUST have:
+[
+  {
+    "scene_index": 1,
+    "prompt": "Style: ... [Full Prompt Text]"
+  },
+  ...
+]
+Do not include any other text or markdown fences.
+"""
+
+# --- Global State ---
+JOB_QUEUE = deque()
+QUEUE_LOCK = threading.Lock()
+CURRENT_LOG = "System Ready. Waiting for jobs..."
+LATEST_VIDEO_PATH = None
+IS_PROCESSING = False
+STOP_GENERATION = False
+CURRENT_JOB_ID = None
+CURRENT_PROCESS = None
+CURRENT_OUTPUT_PATH = None
+CURRENT_SCENE_INDEX = -1
+# SCENES_DATA will store: {'prompt': str, 'video_path': str, 'first_frame': str, 'last_frame': str}
+SCENES_DATA = [None] * 10 
+
+# --- Logic Functions ---
+
+def extract_frame(video_path, output_image_path, frame_idx=0):
+    """Extracts a specific frame by index using OpenCV"""
+    cap = cv2.VideoCapture(video_path)
+    if not cap.isOpened():
+        return False
+        
+    frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+    
+    # Handle negative indices (like -1 for last)
+    if frame_idx < 0:
+        frame_idx = frame_count + frame_idx
+    
+    # Bounds check
+    if frame_idx >= frame_count or frame_idx < 0:
+        cap.release()
+        return False
+
+    cap.set(cv2.CAP_PROP_POS_FRAMES, frame_idx)
+    success, frame = cap.read()
+    if success:
+        cv2.imwrite(output_image_path, frame)
+        cap.release()
+        return True
+    cap.release()
+    return False
+
+def extract_first_frame(video_path, output_image_path):
+    return extract_frame(video_path, output_image_path, 0)
+
+def extract_last_frame(video_path, output_image_path):
+    return extract_frame(video_path, output_image_path, -1)
+
+def call_gemini(api_key, user_prompt):
+    global SCENES_DATA
+    if not api_key:
+        return "Please provide a Gemini API Key.", []
+    
+    try:
+        genai.configure(api_key=api_key)
+        model = genai.GenerativeModel('gemini-3-flash-preview', system_instruction=SYSTEM_INSTRUCTION)
+        response = model.generate_content(user_prompt)
+        text = response.text.strip()
+        
+        # Clean potential markdown fences
+        if text.startswith("```json"):
+            text = text[7:]
+        if text.endswith("```"):
+            text = text[:-3]
+        
+        scenes = json.loads(text)
+        
+        # Initialize SCENES_DATA
+        new_scenes_data = [None] * 10
+        for i, scene in enumerate(scenes):
+            if i < 10:
+                new_scenes_data[i] = {
+                    'prompt': scene['prompt'],
+                    'video_path': None,
+                    'first_frame': None,
+                    'last_frame': None
+                }
+        SCENES_DATA = new_scenes_data
+        
+        # Prepare UI updates
+        ui_updates = []
+        for i in range(10): 
+            if i < len(scenes):
+                # Row visible, Textbox updated
+                ui_updates.append(gr.update(visible=True)) # Row
+                ui_updates.append(gr.update(value=scenes[i]['prompt'], visible=True)) # Textbox
+            else:
+                ui_updates.append(gr.update(visible=False)) # Row
+                ui_updates.append(gr.update(visible=False)) # Textbox
+        
+        return tuple(["Story decomposed into scenes."] + ui_updates)
+    except Exception as e:
+        print(f"DEBUG Error in call_gemini: {str(e)}")
+        return tuple([f"Error: {str(e)}"] + [gr.update(visible=False)] * 10)
+
+def process_chain_generation(scenes_list, checkpoint, gemma, upsampler, steps, fps, width, height, num_frames, seed, random_seed, use_context_compression, latent_reuse_count, context_depth, start_index=None, mode="forward"):
+    """
+    mode: "forward" (chain from start_index up to end) or "backward" (chain from start_index down to 0) or "single" (only start_index)
+    """
+    global CURRENT_LOG, LATEST_VIDEO_PATH, CURRENT_PROCESS, CURRENT_OUTPUT_PATH, IS_PROCESSING, STOP_GENERATION, SCENES_DATA
+
+    IS_PROCESSING = True
+    STOP_GENERATION = False
+    
+    # scenes_list is a list of prompts (or None for empty slots)
+    valid_indices = [i for i, p in enumerate(scenes_list) if p]
+    if not valid_indices:
+        IS_PROCESSING = False
+        return
+
+    if mode == "single":
+        indices_to_process = [start_index]
+        orig_start_idx = start_index
+    elif mode == "backward":
+        # Backward chain from start_index (or the last valid index) down to 0
+        current_start = start_index if start_index is not None else valid_indices[-1]
+        indices_to_process = [i for i in range(current_start, -1, -1) if i in valid_indices]
+        orig_start_idx = current_start
+    else:
+        # Forward chain from start_index (or the first valid index) up to the end
+        current_start = start_index if start_index is not None else valid_indices[0]
+        indices_to_process = [i for i in range(current_start, len(scenes_list)) if i in valid_indices]
+        orig_start_idx = current_start
+
+    for i in indices_to_process:
+        global CURRENT_SCENE_INDEX
+        CURRENT_SCENE_INDEX = i
+        if STOP_GENERATION:
+            CURRENT_LOG += "\n--- STOPPED BY USER ---\n"
+            break
+            
+        prompt = scenes_list[i]
+        scene_id = i + 1
+        CURRENT_LOG += f"\n\n--- GENERATING SCENE {scene_id} ---\n"
+        
+        timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
+        output_filename = f"scene_{scene_id}_{timestamp}.mp4"
+        output_path = os.path.abspath(output_filename)
+        CURRENT_OUTPUT_PATH = output_path
+        
+        # --- Context & Continuity Setup ---
+        actual_prompt = prompt
+        actual_num_frames = int(num_frames)
+        conditioning_frames = []
+        
+        current_seed = seed
+        if random_seed:
+            current_seed = int(os.urandom(4).hex(), 16) % (2 ** 32)
+
+        if use_context_compression and mode == "forward" and i > 0:
+            combined_prompt_parts = []
+            total_reused_latents = 0
+            
+            start_j = max(0, i - int(context_depth))
+            for j in range(start_j, i):
+                if SCENES_DATA[j] and SCENES_DATA[j]['video_path']:
+                    n_j = max(1, int(latent_reuse_count) - (i - 1 - j))
+                    prev_video = SCENES_DATA[j]['video_path']
+                    
+                    # Extract last n_j latents from prev_video
+                    cap = cv2.VideoCapture(prev_video)
+                    prev_frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+                    cap.release()
+                    
+                    last_latent_idx_j = (prev_frame_count - 1) // 8
+                    
+                    prompt_j = SCENES_DATA[j]['prompt']
+                    latent_range_str = f"{total_reused_latents}-{total_reused_latents + n_j - 1}" if n_j > 1 else f"{total_reused_latents}"
+                    combined_prompt_parts.append(f"Prev context for {latent_range_str} latent: {prompt_j}")
+                    
+                    for latent_offset in range(n_j):
+                        target_latent_idx_in_prev = last_latent_idx_j - (n_j - 1 - latent_offset)
+                        frame_idx_in_prev = target_latent_idx_in_prev * 8
+                        
+                        tmp_frame_path = f"scene_{scene_id}_ctx_{j}_{latent_offset}.jpg"
+                        if extract_frame(prev_video, tmp_frame_path, frame_idx_in_prev):
+                            guidance = 1.0 if (total_reused_latents + latent_offset) == 0 else 0.1
+                            conditioning_frames.append((tmp_frame_path, total_reused_latents + latent_offset, guidance))
+                    
+                    total_reused_latents += n_j
+            
+            if combined_prompt_parts:
+                actual_prompt = ", ".join(combined_prompt_parts) + ", Current scene: " + prompt
+                actual_num_frames = int(num_frames) + total_reused_latents * 8
+                CURRENT_LOG += f"Context Compression: Added {total_reused_latents} latents from {len(combined_prompt_parts)} prev scenes.\n"
+        else:
+            # Standard Continuity (non-context-compression)
+            max_latent = int(num_frames) // 8
+            if mode == "forward":
+                # Connect to previous scene's LAST frames (112, 120) as starting points (0, 1)
+                if i > 0 and SCENES_DATA[i-1] and SCENES_DATA[i-1]['video_path']:
+                    prev_video = SCENES_DATA[i-1]['video_path']
+                    # We need the last 2 latents. 
+                    # If scene was 121 frames, last latents are usually 14 and 15 (if indexed from 0).
+                    # Let's use the actual frame count to be safe.
+                    cap = cv2.VideoCapture(prev_video)
+                    prev_frame_cnt = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+                    cap.release()
+                    last_lat = (prev_frame_cnt - 1) // 8
+                    
+                    f_prev_l1 = f"scene_{scene_id}_f_prev_last1.jpg"
+                    f_prev_l2 = f"scene_{scene_id}_f_prev_last2.jpg"
+                    if extract_frame(prev_video, f_prev_l1, (last_lat-1)*8) and extract_frame(prev_video, f_prev_l2, last_lat*8):
+                         conditioning_frames = [(f_prev_l1, 0, 1.0), (f_prev_l2, 1, 0.1)]
+                         CURRENT_LOG += f"Connecting Scene {scene_id} to Scene {i} (last frames as latents 0,1)\n"
+            elif mode == "single":
+                # For SINGLE regeneration, we connect both ways if possible
+                if i > 0 and SCENES_DATA[i-1] and SCENES_DATA[i-1]['video_path']:
+                    prev_video = SCENES_DATA[i-1]['video_path']
+                    cap = cv2.VideoCapture(prev_video)
+                    prev_frame_cnt = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+                    cap.release()
+                    last_lat = (prev_frame_cnt - 1) // 8
+                    
+                    f_prev_l1 = f"scene_{i}_f_prev_last1.jpg"
+                    f_prev_l2 = f"scene_{i}_f_prev_last2.jpg"
+                    if extract_frame(prev_video, f_prev_l1, (last_lat-1)*8) and extract_frame(prev_video, f_prev_l2, last_lat*8):
+                         conditioning_frames.append((f_prev_l1, 0, 1.0))
+                         conditioning_frames.append((f_prev_l2, 1, 0.1))
+                         CURRENT_LOG += f"Connecting to Prev Scene {i} end\n"
+                
+                # 2. End at OLD VERSION's END (bridging)
+                if SCENES_DATA[i] and SCENES_DATA[i]['video_path']:
+                    old_video = SCENES_DATA[i]['video_path']
+                    cap = cv2.VideoCapture(old_video)
+                    old_frame_cnt = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+                    cap.release()
+                    old_last_lat = (old_frame_cnt - 1) // 8
+                    
+                    f_old_l1 = f"scene_{i}_f_old_last1.jpg"
+                    f_old_l2 = f"scene_{i}_f_old_last2.jpg"
+                    if extract_frame(old_video, f_old_l1, (old_last_lat-1)*8) and extract_frame(old_video, f_old_l2, old_last_lat*8):
+                        conditioning_frames.append((f_old_l1, max_latent-1, 1.0))
+                        conditioning_frames.append((f_old_l2, max_latent, 0.1))
+                        CURRENT_LOG += f"Connecting to Old Scene {i+1} end (bridging)\n"
+            else:
+                # Backward conditioning
+                if i + 1 < len(SCENES_DATA) and SCENES_DATA[i+1] and SCENES_DATA[i+1]['first_frame']:
+                    conditioning_frames = [(SCENES_DATA[i+1]['first_frame'], max_latent, 1.0)]
+                    CURRENT_LOG += f"Using continuity frame from scene {i+2} (latent index {max_latent})\n"
+
+        # Build Command
+        cmd = [
+            sys.executable, "-m", "ltx_pipelines.distilled",
+            # "kernprof", "-l", "-v", "-m", "ltx_pipelines.distilled",
+            "--checkpoint-path", checkpoint,
+            "--gemma-root", gemma,
+            "--spatial-upsampler-path", upsampler,
+            "--prompt", actual_prompt,
+            "--output-path", output_path,
+            "--width", str(width),
+            "--height", str(height),
+            "--num-frames", str(int(actual_num_frames)),
+            "--frame-rate", str(fps),
+            "--num-inference-steps", str(int(steps)),
+            "--seed", str(int(current_seed)),
+            "--enable-fp8"
+        ]
+            
+        for frame_path, latent_idx, guidance in conditioning_frames:
+            cmd.extend(["--image", frame_path, str(latent_idx), str(guidance)])
+
+        try:
+            CURRENT_PROCESS = subprocess.Popen(
+                cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
+                text=True, bufsize=1, universal_newlines=True
+            )
+            for line in CURRENT_PROCESS.stdout:
+                if STOP_GENERATION:
+                    # Try to terminate
+                    subprocess.run(["taskkill", "/F", "/T", "/PID", str(CURRENT_PROCESS.pid)], capture_output=True)
+                    break
+                CURRENT_LOG += line
+            CURRENT_PROCESS.wait()
+            
+            if CURRENT_PROCESS.returncode == 0 and os.path.exists(output_path):
+                CURRENT_LOG += f"Scene {scene_id} Complete.\n"
+                LATEST_VIDEO_PATH = output_path
+                
+                # Update SCENES_DATA
+                first_f = f"scene_{scene_id}_first.jpg"
+                last_f = f"scene_{scene_id}_last.jpg"
+                extract_first_frame(output_path, first_f)
+                extract_last_frame(output_path, last_f)
+                
+                SCENES_DATA[i] = {
+                    'prompt': prompt,
+                    'video_path': output_path,
+                    'first_frame': first_f,
+                    'last_frame': last_f
+                }
+            else:
+                CURRENT_LOG += f"Scene {scene_id} Failed or Canceled.\n"
+                break
+        except Exception as e:
+            CURRENT_LOG += f"Exception: {str(e)}\n"
+            break
+            
+    CURRENT_LOG += "\n--- GENERATION CYCLE FINISHED ---\n"
+    IS_PROCESSING = False
+
+def start_generation_thread(s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, checkpoint, gemma, upsampler, steps, fps, width, height, num_frames, seed, random_seed, use_context_compression, latent_reuse_count, context_depth, start_index=None, mode="forward"):
+    global SCENES_DATA
+    scenes = [s1, s2, s3, s4, s5, s6, s7, s8, s9, s10]
+    
+    # Clear subsequent scenes in data if starting a chain or single regeneration
+    if mode == "forward":
+        begin_idx = start_index if start_index is not None else 0
+        for i in range(begin_idx, 10):
+            if i < len(scenes) and scenes[i]:
+                SCENES_DATA[i] = {'prompt': scenes[i], 'video_path': None, 'first_frame': None, 'last_frame': None}
+            else:
+                SCENES_DATA[i] = None
+
+    threading.Thread(target=process_chain_generation, args=(scenes, checkpoint, gemma, upsampler, steps, fps, width, height, num_frames, seed, random_seed, use_context_compression, latent_reuse_count, context_depth, start_index, mode), daemon=True).start()
+    return "Generation started..."
+
+def stop_generation():
+    global STOP_GENERATION
+    STOP_GENERATION = True
+    return "Stopping..."
+
+def update_ui():
+    global LATEST_VIDEO_PATH, CURRENT_LOG, SCENES_DATA, CURRENT_OUTPUT_PATH, CURRENT_SCENE_INDEX
+    status = "Processing..." if IS_PROCESSING else "Idle"
+    
+    # Prepare updates for all scene boxes
+    updates = []
+    for i in range(10):
+        data = SCENES_DATA[i]
+        
+        display_video = data.get('video_path') if data else None
+        display_preview = data.get('last_frame') if data else None
+        
+        # If this is the active scene being generated, look for intermediate preview
+        if IS_PROCESSING and CURRENT_SCENE_INDEX == i and CURRENT_OUTPUT_PATH:
+            preview_file = CURRENT_OUTPUT_PATH.replace('.mp4', '_.mp4')
+            if os.path.exists(preview_file):
+                display_video = preview_file
+        
+        if data or (IS_PROCESSING and CURRENT_SCENE_INDEX == i):
+            v_val = display_video
+            p_val = display_preview
+            updates.append(gr.update(value=v_val, visible=True))
+            updates.append(gr.update(value=p_val, visible=True))
+        else:
+            updates.append(gr.update(value=None)) # Video
+            updates.append(gr.update(value=None)) # Image
+            
+    return tuple([LATEST_VIDEO_PATH, CURRENT_LOG, status] + updates)
+
+def cancel_job():
+    global CURRENT_PROCESS, CURRENT_LOG
+    if CURRENT_PROCESS:
+        try:
+            subprocess.run(["taskkill", "/F", "/T", "/PID", str(CURRENT_PROCESS.pid)], capture_output=True)
+            CURRENT_LOG += "\n--- CANCELED ---\n"
+            return "Canceled."
+        except:
+            return "Error canceling."
+    return "No active process."
+
+# --- UI Layout ---
+
+theme = gr.themes.Soft(primary_hue="blue").set(
+    body_background_fill="*neutral_50",
+    block_background_fill="*neutral_100",
+)
+
+with gr.Blocks(title="LTX-2 Film Maker", theme=theme) as demo:
+    gr.Markdown("# 🎬 LTX-2 CinemaMaker: Story to Film")
+    
+    with gr.Row():
+        with gr.Column(scale=1):
+            gemini_key = gr.Textbox(label="Gemini API Key", type="password")
+            story_prompt = gr.Textbox(label="Whole Movie Idea", placeholder="A robot finds a lost kitten in a rainy city...", lines=4)
+            decompose_btn = gr.Button("✨ Decompose into Scenes", variant="primary")
+            
+            with gr.Accordion("LTX-2 Settings", open=False):
+                checkpoint = gr.Textbox(label="Checkpoint", value=DEFAULT_CHECKPOINT)
+                gemma = gr.Textbox(label="Gemma Root", value=DEFAULT_GEMMA)
+                upsampler = gr.Textbox(label="Upsampler", value=DEFAULT_UPSAMPLER)
+                with gr.Row():
+                    steps = gr.Slider(label="Steps", minimum=1, maximum=50, value=12)
+                    fps = gr.Number(label="FPS", value=24)
+                with gr.Row():
+                    width = gr.Number(label="Width", value=1536)
+                    height = gr.Number(label="Height", value=1024)
+                num_frames = gr.Slider(label="Frames per Scene", minimum=9, maximum=257, step=8, value=121)
+                with gr.Row():
+                    seed = gr.Number(label="Seed", value=10, precision=0)
+                    random_seed = gr.Checkbox(label="Random Seed", value=True)
+                with gr.Row():
+                    use_context_compression = gr.Checkbox(label="Use Context Compression", value=False)
+                    latent_reuse_count = gr.Slider(label="Latent Reuse", minimum=1, maximum=8, step=1, value=2)
+                    context_depth = gr.Slider(label="Context Depth", minimum=1, maximum=5, step=1, value=2)
+
+        with gr.Column(scale=3):
+            gr.Markdown("### 🎞️ Film Scenes")
+            scene_rows = []
+            scene_prompts = []
+            scene_videos = []
+            scene_previews = []
+            scene_reg_chain_btns = []
+            scene_reg_single_btns = []
+            
+            for i in range(1, 11):
+                idx = i - 1
+                with gr.Row(visible=False) as row: # Hidden until decomposed
+                    scene_rows.append(row)
+                    with gr.Column(scale=3):
+                        prompt_box = gr.Textbox(label=f"Scene {i}", lines=3)
+                        scene_prompts.append(prompt_box)
+                        with gr.Row():
+                            chain_btn = gr.Button(f"🔗 Chain From {i}", size="sm")
+                            single_btn = gr.Button(f"🎯 Only {i}", size="sm")
+                            scene_reg_chain_btns.append(chain_btn)
+                            scene_reg_single_btns.append(single_btn)
+                    
+                    video_comp = gr.Video(label="Clip", scale=2)
+                    preview_comp = gr.Image(label="Ends with", scale=1) # The LAST frame of this scene
+                    
+                    scene_videos.append(video_comp)
+                    scene_previews.append(preview_comp)
+            
+            with gr.Row():
+                generate_btn = gr.Button("🚀 Start Full Forward Chain", variant="primary", size="lg")
+                stop_btn = gr.Button("🛑 Stop", variant="secondary", size="lg")
+                cancel_btn = gr.Button("🗑️ Kill Process", variant="stop")
+            
+            latest_video = gr.Video(label="Latest Generated Scene (Global View)")
+            status_box = gr.Textbox(label="Status", interactive=False)
+    
+    with gr.Accordion("Worker Log", open=False):
+        log_box = gr.Textbox(label=None, lines=10, interactive=False)
+
+    # --- Events ---
+    # Decompose into scenes updates Rows and Textboxes
+    decompose_btn.click(
+        fn=call_gemini,
+        inputs=[gemini_key, story_prompt],
+        outputs=[status_box] + [comp for zip_list in zip(scene_rows, scene_prompts) for comp in zip_list]
+    )
+    
+    generate_btn.click(
+        fn=start_generation_thread,
+        inputs=scene_prompts + [checkpoint, gemma, upsampler, steps, fps, width, height, num_frames, seed, random_seed, use_context_compression, latent_reuse_count, context_depth],
+        outputs=[status_box]
+    )
+
+    stop_btn.click(fn=stop_generation, outputs=[status_box])
+    cancel_btn.click(fn=cancel_job, outputs=[status_box])
+
+    # Per-scene buttons
+    for i in range(10):
+        def make_chain_fn(index):
+            def chain_fn(*args):
+                return start_generation_thread(*args, start_index=index, mode="forward")
+            return chain_fn
+            
+        def make_single_fn(index):
+            def single_fn(*args):
+                return start_generation_thread(*args, start_index=index, mode="single")
+            return single_fn
+
+        scene_reg_chain_btns[i].click(
+            fn=make_chain_fn(i),
+            inputs=scene_prompts + [checkpoint, gemma, upsampler, steps, fps, width, height, num_frames, seed, random_seed, use_context_compression, latent_reuse_count, context_depth],
+            outputs=[status_box]
+        )
+        
+        scene_reg_single_btns[i].click(
+            fn=make_single_fn(i),
+            inputs=scene_prompts + [checkpoint, gemma, upsampler, steps, fps, width, height, num_frames, seed, random_seed, use_context_compression, latent_reuse_count, context_depth],
+            outputs=[status_box]
+        )
+    
+    timer = gr.Timer(2)
+    timer.tick(
+        fn=update_ui, 
+        outputs=[latest_video, log_box, status_box] + [comp for zip_list in zip(scene_videos, scene_previews) for comp in zip_list]
+    )
+
+if __name__ == "__main__":
+    demo.launch(server_name="0.0.0.0")
diff --git a/music_maker_ui.py b/music_maker_ui.py
new file mode 100644
index 00000000..aa0529d4
--- /dev/null
+++ b/music_maker_ui.py
@@ -0,0 +1,468 @@
+import gradio as gr
+import subprocess
+import os
+import datetime
+import threading
+import sys
+import math
+import torchaudio
+from collections import deque
+import cv2
+
+# --- Configuration & Defaults ---
+DEFAULT_CHECKPOINT = "./models/ltx-2-19b-distilled-fp8.safetensors"
+DEFAULT_GEMMA = "./models/gemma3"
+DEFAULT_UPSAMPLER = "./models/ltx-2-spatial-upscaler-x2-1.0.safetensors"
+AUDIO_CLIPS_DIR = "./audio_clips"
+
+# --- Global State ---
+JOB_QUEUE = deque()
+QUEUE_LOCK = threading.Lock()
+CURRENT_LOG = "System Ready. Waiting for jobs..."
+LATEST_VIDEO_PATH = None
+IS_PROCESSING = False
+STOP_GENERATION = False
+CURRENT_JOB_ID = None
+CURRENT_PROCESS = None
+CURRENT_OUTPUT_PATH = None
+CURRENT_SCENE_INDEX = -1
+# SCENES_DATA will store: {'prompt': str, 'video_path': str, 'audio_path': str, 'first_frame': str, 'last_frame': str}
+SCENES_DATA = [None] * 20  # Increased to 20 scenes support
+
+# --- Logic Functions ---
+
+def extract_frame(video_path, output_image_path, frame_idx=0):
+    """Extracts a specific frame by index using OpenCV"""
+    cap = cv2.VideoCapture(video_path)
+    if not cap.isOpened():
+        return False
+        
+    frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+    
+    # Handle negative indices (like -1 for last)
+    if frame_idx < 0:
+        frame_idx = frame_count + frame_idx
+    
+    # Bounds check
+    if frame_idx >= frame_count or frame_idx < 0:
+        cap.release()
+        return False
+
+    cap.set(cv2.CAP_PROP_POS_FRAMES, frame_idx)
+    success, frame = cap.read()
+    if success:
+        cv2.imwrite(output_image_path, frame)
+        cap.release()
+        return True
+    cap.release()
+    return False
+
+def extract_first_frame(video_path, output_image_path):
+    return extract_frame(video_path, output_image_path, 0)
+
+def extract_last_frame(video_path, output_image_path):
+    return extract_frame(video_path, output_image_path, -1)
+
+def slice_audio(audio_path, prompt, fps, num_frames):
+    global SCENES_DATA
+    if not audio_path:
+        return "Please upload an audio file.", []
+    
+    try:
+        os.makedirs(AUDIO_CLIPS_DIR, exist_ok=True)
+        
+        # Load audio info
+        info = torchaudio.info(audio_path)
+        sample_rate = info.sample_rate
+        total_frames = info.num_frames
+        duration_sec = total_frames / sample_rate
+        
+        # Calculate video scene duration
+        scene_duration_sec = num_frames / fps
+        
+        num_scenes = math.ceil(duration_sec / scene_duration_sec)
+        num_scenes = min(num_scenes, 20) # Limit to 20 scenes
+        
+        waveform, sr = torchaudio.load(audio_path)
+        
+        new_scenes_data = [None] * 20
+        ui_updates = []
+        
+        samples_per_scene = int(scene_duration_sec * sr)
+        
+        for i in range(num_scenes):
+            start_sample = i * samples_per_scene
+            end_sample = min((i + 1) * samples_per_scene, total_frames)
+            
+            chunk_waveform = waveform[:, start_sample:end_sample]
+            
+            # Save chunk
+            chunk_filename = f"scene_{i+1}_audio.wav"
+            chunk_path = os.path.join(AUDIO_CLIPS_DIR, chunk_filename)
+            torchaudio.save(chunk_path, chunk_waveform, sr)
+            
+            new_scenes_data[i] = {
+                'prompt': prompt, # Copy master prompt
+                'video_path': None,
+                'audio_path': os.path.abspath(chunk_path),
+                'first_frame': None,
+                'last_frame': None
+            }
+            
+        SCENES_DATA = new_scenes_data
+        
+        # Prepare UI updates
+        for i in range(20):
+            if i < num_scenes:
+                # Row visible, Textbox updated
+                ui_updates.append(gr.update(visible=True)) # Row
+                ui_updates.append(gr.update(value=prompt, visible=True)) # Textbox
+                ui_updates.append(gr.update(value=new_scenes_data[i]['audio_path'])) # Audio path display
+            else:
+                ui_updates.append(gr.update(visible=False)) # Row
+                ui_updates.append(gr.update(visible=False)) # Textbox
+                ui_updates.append(gr.update(value=None))
+        
+        return tuple([f"Sliced into {num_scenes} scenes."] + ui_updates)
+        
+    except Exception as e:
+        print(f"DEBUG Error in slice_audio: {str(e)}")
+        import traceback
+        traceback.print_exc()
+        return tuple([f"Error: {str(e)}"] + [gr.update(visible=False)] * 20 * 3) # Update this count if UI structure changes
+
+def process_chain_generation(scenes_list, checkpoint, gemma, upsampler, steps, fps, width, height, num_frames, seed, random_seed, use_context_compression, latent_reuse_count, context_depth, start_index=None, mode="forward"):
+    """
+    mode: "forward" (chain from start_index up to end) or "single" (only start_index)
+    """
+    global CURRENT_LOG, LATEST_VIDEO_PATH, CURRENT_PROCESS, CURRENT_OUTPUT_PATH, IS_PROCESSING, STOP_GENERATION, SCENES_DATA
+
+    IS_PROCESSING = True
+    STOP_GENERATION = False
+    
+    # scenes_list is a list of prompts (or None for empty slots)
+    valid_indices = [i for i, p in enumerate(scenes_list) if p]
+    if not valid_indices:
+        IS_PROCESSING = False
+        return
+
+    if mode == "single":
+        indices_to_process = [start_index]
+    else:
+        # Forward chain from start_index (or the first valid index) up to the end
+        current_start = start_index if start_index is not None else valid_indices[0]
+        indices_to_process = [i for i in range(current_start, len(scenes_list)) if i in valid_indices]
+
+    for i in indices_to_process:
+        global CURRENT_SCENE_INDEX
+        CURRENT_SCENE_INDEX = i
+        if STOP_GENERATION:
+            CURRENT_LOG += "\n--- STOPPED BY USER ---\n"
+            break
+            
+        prompt = scenes_list[i]
+        scene_data = SCENES_DATA[i]
+        
+        if not scene_data:
+             CURRENT_LOG += f"\nSkipping scene {i+1} : No Data\n"
+             continue
+             
+        audio_path = scene_data.get('audio_path')
+        scene_id = i + 1
+        CURRENT_LOG += f"\n\n--- GENERATING SCENE {scene_id} ---\n"
+        
+        timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
+        output_filename = f"scene_{scene_id}_{timestamp}.mp4"
+        output_path = os.path.abspath(output_filename)
+        CURRENT_OUTPUT_PATH = output_path
+        
+        # --- Context & Continuity Setup ---
+        actual_prompt = prompt
+        actual_num_frames = int(num_frames)
+        conditioning_frames = []
+        
+        current_seed = seed
+        if random_seed:
+            current_seed = int(os.urandom(4).hex(), 16) % (2 ** 32)
+
+        # Standard Continuity (Music Video likely doesn't need context compression as much as continuity? Let's keep Standard for now)
+        if mode == "forward":
+            # Connect to previous scene's LAST frames (112, 120) as starting points (0, 1)
+            if i > 0 and SCENES_DATA[i-1] and SCENES_DATA[i-1]['video_path']:
+                prev_video = SCENES_DATA[i-1]['video_path']
+                
+                cap = cv2.VideoCapture(prev_video)
+                prev_frame_cnt = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+                cap.release()
+                last_lat = (prev_frame_cnt - 1) // 8
+                
+                f_prev_l1 = f"scene_{scene_id}_f_prev_last1.jpg"
+                f_prev_l2 = f"scene_{scene_id}_f_prev_last2.jpg"
+                if extract_frame(prev_video, f_prev_l1, (last_lat-1)*8) and extract_frame(prev_video, f_prev_l2, last_lat*8):
+                     conditioning_frames = [(f_prev_l1, 0, 1.0), (f_prev_l2, 1, 0.1)]
+                     CURRENT_LOG += f"Connecting Scene {scene_id} to Scene {i} (last frames as latents)\n"
+
+        # Build Command for music_to_video.py
+        cmd = [
+            sys.executable, "-m", "ltx_pipelines.music_to_video",
+            "--checkpoint-path", checkpoint,
+            "--gemma-root", gemma,
+            "--spatial-upsampler-path", upsampler,
+            "--prompt", actual_prompt,
+            "--output-path", output_path,
+            "--width", str(width),
+            "--height", str(height),
+            "--num-frames", str(int(actual_num_frames)),
+            "--frame-rate", str(fps),
+            "--num-inference-steps", str(int(steps)),
+            "--seed", str(int(current_seed)),
+            "--enable-fp8"
+        ]
+        
+        if audio_path:
+            cmd.extend(["--audio-input-path", audio_path])
+            
+        for frame_path, latent_idx, guidance in conditioning_frames:
+            cmd.extend(["--image", frame_path, str(latent_idx), str(guidance)])
+
+        try:
+            CURRENT_PROCESS = subprocess.Popen(
+                cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
+                text=True, bufsize=1, universal_newlines=True
+            )
+            for line in CURRENT_PROCESS.stdout:
+                if STOP_GENERATION:
+                    subprocess.run(["taskkill", "/F", "/T", "/PID", str(CURRENT_PROCESS.pid)], capture_output=True)
+                    break
+                CURRENT_LOG += line
+            CURRENT_PROCESS.wait()
+            
+            if CURRENT_PROCESS.returncode == 0 and os.path.exists(output_path):
+                CURRENT_LOG += f"Scene {scene_id} Complete.\n"
+                LATEST_VIDEO_PATH = output_path
+                
+                # Update SCENES_DATA
+                first_f = f"scene_{scene_id}_first.jpg"
+                last_f = f"scene_{scene_id}_last.jpg"
+                extract_first_frame(output_path, first_f)
+                extract_last_frame(output_path, last_f)
+                
+                SCENES_DATA[i]['video_path'] = output_path
+                SCENES_DATA[i]['first_frame'] = first_f
+                SCENES_DATA[i]['last_frame'] = last_f
+            else:
+                CURRENT_LOG += f"Scene {scene_id} Failed or Canceled.\n"
+                break
+        except Exception as e:
+            CURRENT_LOG += f"Exception: {str(e)}\n"
+            break
+            
+    CURRENT_LOG += "\n--- GENERATION CYCLE FINISHED ---\n"
+    IS_PROCESSING = False
+
+def start_generation_thread(prompts, checkpoint, gemma, upsampler, steps, fps, width, height, num_frames, seed, random_seed, use_context_compression, latent_reuse_count, context_depth, start_index=None, mode="forward"):
+    # Clear subsequent scenes in data if starting a chain or single regeneration logic?
+    # For music video, if we regenerate, we keep the audio_path!
+    # So we should only clear video_path.
+    
+    if mode == "forward":
+        begin_idx = start_index if start_index is not None else 0
+        for i in range(begin_idx, 20):
+            if SCENES_DATA[i]:
+                SCENES_DATA[i]['video_path'] = None
+                SCENES_DATA[i]['first_frame'] = None
+                SCENES_DATA[i]['last_frame'] = None
+
+    threading.Thread(target=process_chain_generation, args=(prompts, checkpoint, gemma, upsampler, steps, fps, width, height, num_frames, seed, random_seed, use_context_compression, latent_reuse_count, context_depth, start_index, mode), daemon=True).start()
+    return "Generation started..."
+
+def stop_generation():
+    global STOP_GENERATION
+    STOP_GENERATION = True
+    return "Stopping..."
+
+def update_ui():
+    global LATEST_VIDEO_PATH, CURRENT_LOG, SCENES_DATA, CURRENT_OUTPUT_PATH, CURRENT_SCENE_INDEX
+    status = "Processing..." if IS_PROCESSING else "Idle"
+    
+    # Prepare updates for all scene boxes
+    updates = []
+    for i in range(20):
+        data = SCENES_DATA[i]
+        
+        display_video = data.get('video_path') if data else None
+        display_preview = data.get('last_frame') if data else None
+        
+        # Intermediate preview logic
+        if IS_PROCESSING and CURRENT_SCENE_INDEX == i and CURRENT_OUTPUT_PATH:
+            preview_file = CURRENT_OUTPUT_PATH.replace('.mp4', '_.mp4')
+            if os.path.exists(preview_file):
+                display_video = preview_file
+        
+        if data or (IS_PROCESSING and CURRENT_SCENE_INDEX == i):
+            v_val = display_video
+            p_val = display_preview
+            updates.append(gr.update(value=v_val, visible=True))
+            updates.append(gr.update(value=p_val, visible=True))
+        else:
+            updates.append(gr.update(value=None)) # Video
+            updates.append(gr.update(value=None)) # Image
+            
+    return tuple([LATEST_VIDEO_PATH, CURRENT_LOG, status] + updates)
+
+def cancel_job():
+    global CURRENT_PROCESS, CURRENT_LOG
+    if CURRENT_PROCESS:
+        try:
+            subprocess.run(["taskkill", "/F", "/T", "/PID", str(CURRENT_PROCESS.pid)], capture_output=True)
+            CURRENT_LOG += "\n--- CANCELED ---\n"
+            return "Canceled."
+        except:
+            return "Error canceling."
+    return "No active process."
+
+# --- UI Layout ---
+
+theme = gr.themes.Soft(primary_hue="purple").set(
+    body_background_fill="*neutral_50",
+    block_background_fill="*neutral_100",
+)
+
+with gr.Blocks(title="LTX-2 Music Video Maker", theme=theme) as demo:
+    gr.Markdown("# 🎵 LTX-2 Music Video Maker")
+    
+    with gr.Row():
+        with gr.Column(scale=1):
+            audio_file = gr.Audio(label="Upload Music File", type="filepath")
+            master_prompt = gr.Textbox(label="Visual Style / Prompt", placeholder="Cyberpunk city, neon lights, rain...", lines=2)
+            
+            with gr.Accordion("LTX-2 Settings", open=True):
+                checkpoint = gr.Textbox(label="Checkpoint", value=DEFAULT_CHECKPOINT)
+                gemma = gr.Textbox(label="Gemma Root", value=DEFAULT_GEMMA)
+                upsampler = gr.Textbox(label="Upsampler", value=DEFAULT_UPSAMPLER)
+                with gr.Row():
+                    steps = gr.Slider(label="Steps", minimum=1, maximum=50, value=12)
+                    fps = gr.Number(label="FPS", value=24)
+                with gr.Row():
+                    width = gr.Number(label="Width", value=1536)
+                    height = gr.Number(label="Height", value=1024)
+                num_frames = gr.Slider(label="Frames per Scene", minimum=9, maximum=257, step=8, value=121)
+                
+                slice_btn = gr.Button("🔪 Slice Music & Prepare Scenes", variant="primary")
+                
+                with gr.Row():
+                    seed = gr.Number(label="Seed", value=10, precision=0)
+                    random_seed = gr.Checkbox(label="Random Seed", value=True)
+                with gr.Row():
+                    # context compression not prioritized for now, keeping args for compatibility
+                    use_context_compression = gr.Checkbox(label="Use Context Compression", value=False, visible=False) 
+                    latent_reuse_count = gr.Slider(label="Latent Reuse", minimum=1, maximum=8, step=1, value=2, visible=False)
+                    context_depth = gr.Slider(label="Context Depth", minimum=1, maximum=5, step=1, value=2, visible=False)
+
+        with gr.Column(scale=3):
+            gr.Markdown("### 🎞️ Video Scenes")
+            scene_rows = []
+            scene_prompts = []
+            scene_audio_labels = [] 
+            scene_videos = []
+            scene_previews = []
+            scene_reg_chain_btns = []
+            scene_reg_single_btns = []
+            
+            for i in range(1, 21):
+                with gr.Row(visible=False) as row: # Hidden until decomposed
+                    scene_rows.append(row)
+                    with gr.Column(scale=3):
+                        prompt_box = gr.Textbox(label=f"Scene {i} Prompt", lines=2)
+                        scene_prompts.append(prompt_box)
+                        audio_lbl = gr.Textbox(label=f"Scene {i} Audio", interactive=False)
+                        scene_audio_labels.append(audio_lbl)
+                        
+                        with gr.Row():
+                            chain_btn = gr.Button(f"🔗 Chain From {i}", size="sm")
+                            single_btn = gr.Button(f"🎯 Only {i}", size="sm")
+                            scene_reg_chain_btns.append(chain_btn)
+                            scene_reg_single_btns.append(single_btn)
+                    
+                    video_comp = gr.Video(label="Clip", scale=2)
+                    preview_comp = gr.Image(label="Last Frame", scale=1) 
+                    
+                    scene_videos.append(video_comp)
+                    scene_previews.append(preview_comp)
+            
+            with gr.Row():
+                generate_btn = gr.Button("🚀 Start Full Forward Chain", variant="primary", size="lg")
+                stop_btn = gr.Button("🛑 Stop", variant="secondary", size="lg")
+                cancel_btn = gr.Button("🗑️ Kill Process", variant="stop")
+            
+            latest_video = gr.Video(label="Latest Generated Scene (Global View)")
+            status_box = gr.Textbox(label="Status", interactive=False)
+    
+    with gr.Accordion("Worker Log", open=False):
+        log_box = gr.Textbox(label=None, lines=10, interactive=False)
+
+    # --- Events ---
+    slice_btn.click(
+        fn=slice_audio,
+        inputs=[audio_file, master_prompt, fps, num_frames],
+        outputs=[status_box] + [comp for triple in zip(scene_rows, scene_prompts, scene_audio_labels) for comp in triple]
+    )
+    
+    # Helper wrapper to collect all prompts
+    def collect_prompts_and_start(*args):
+        # Args structure: [prompts_list..., checkpoint, ..., button_args]
+        # We need to slice args.
+        num_scenes = 20
+        prompts = args[:num_scenes]
+        rest = args[num_scenes:]
+        return start_generation_thread(prompts, *rest)
+
+    all_inputs = scene_prompts + [checkpoint, gemma, upsampler, steps, fps, width, height, num_frames, seed, random_seed, use_context_compression, latent_reuse_count, context_depth]
+    
+    generate_btn.click(
+        fn=collect_prompts_and_start,
+        inputs=all_inputs,
+        outputs=[status_box]
+    )
+
+    stop_btn.click(fn=stop_generation, outputs=[status_box])
+    cancel_btn.click(fn=cancel_job, outputs=[status_box])
+
+    # Per-scene buttons
+    for i in range(20):
+        def make_chain_fn(index):
+            def chain_fn(*args):
+                num_scenes = 20
+                prompts = args[:num_scenes]
+                rest = args[num_scenes:]
+                return start_generation_thread(prompts, *rest, start_index=index, mode="forward")
+            return chain_fn
+            
+        def make_single_fn(index):
+            def single_fn(*args):
+                num_scenes = 20
+                prompts = args[:num_scenes]
+                rest = args[num_scenes:]
+                return start_generation_thread(prompts, *rest, start_index=index, mode="single")
+            return single_fn
+
+        scene_reg_chain_btns[i].click(
+            fn=make_chain_fn(i),
+            inputs=all_inputs,
+            outputs=[status_box]
+        )
+        
+        scene_reg_single_btns[i].click(
+            fn=make_single_fn(i),
+            inputs=all_inputs,
+            outputs=[status_box]
+        )
+    
+    timer = gr.Timer(2)
+    timer.tick(
+        fn=update_ui, 
+        outputs=[latest_video, log_box, status_box] + [comp for zip_list in zip(scene_videos, scene_previews) for comp in zip_list]
+    )
+
+if __name__ == "__main__":
+    demo.launch(server_name="0.0.0.0")
diff --git a/packages/ltx-core/src/ltx_core/loader/fuse_loras.py b/packages/ltx-core/src/ltx_core/loader/fuse_loras.py
index 66269dd7..7b3385b3 100644
--- a/packages/ltx-core/src/ltx_core/loader/fuse_loras.py
+++ b/packages/ltx-core/src/ltx_core/loader/fuse_loras.py
@@ -1,38 +1,64 @@
 import torch
-import triton
 
 from ltx_core.loader.kernels import fused_add_round_kernel
 from ltx_core.loader.primitives import LoraStateDictWithStrength, StateDict
 
 BLOCK_SIZE = 1024
 
+try:
+    import triton
 
-def fused_add_round_launch(target_weight: torch.Tensor, original_weight: torch.Tensor, seed: int) -> torch.Tensor:
-    if original_weight.dtype == torch.float8_e4m3fn:
-        exponent_bits, mantissa_bits, exponent_bias = 4, 3, 7
-    elif original_weight.dtype == torch.float8_e5m2:
-        exponent_bits, mantissa_bits, exponent_bias = 5, 2, 15  # noqa: F841
-    else:
-        raise ValueError("Unsupported dtype")
-
-    if target_weight.dtype != torch.bfloat16:
-        raise ValueError("target_weight dtype must be bfloat16")
-
-    # Calculate grid and block sizes
-    n_elements = original_weight.numel()
-    grid = (triton.cdiv(n_elements, BLOCK_SIZE),)
-
-    # Launch kernel
-    fused_add_round_kernel[grid](
-        original_weight,
-        target_weight,
-        seed,
-        n_elements,
-        exponent_bias,
-        mantissa_bits,
-        BLOCK_SIZE,
-    )
-    return target_weight
+    def fused_add_round_launch(target_weight: torch.Tensor, original_weight: torch.Tensor, seed: int) -> torch.Tensor:
+        if original_weight.dtype == torch.float8_e4m3fn:
+            exponent_bits, mantissa_bits, exponent_bias = 4, 3, 7
+        elif original_weight.dtype == torch.float8_e5m2:
+            exponent_bits, mantissa_bits, exponent_bias = 5, 2, 15  # noqa: F841
+        else:
+            raise ValueError("Unsupported dtype")
+
+        if target_weight.dtype != torch.bfloat16:
+            raise ValueError("target_weight dtype must be bfloat16")
+
+        # Calculate grid and block sizes
+        n_elements = original_weight.numel()
+        grid = (triton.cdiv(n_elements, BLOCK_SIZE),)
+
+        # Launch kernel
+        fused_add_round_kernel[grid](
+            original_weight,
+            target_weight,
+            seed,
+            n_elements,
+            exponent_bias,
+            mantissa_bits,
+            BLOCK_SIZE,
+        )
+        return target_weight
+except Exception:
+    def fused_add_round_launch(target_weight: torch.Tensor, original_weight: torch.Tensor, seed: int) -> torch.Tensor:
+        """
+        Native PyTorch implementation of fused_add_round_launch.
+
+        Note:
+        1. Requires PyTorch 2.1 or newer for torch.float8 support.
+        2. The 'seed' argument is accepted to maintain API compatibility but is ignored
+           because native PyTorch addition uses deterministic Round-To-Nearest-Even (RTNE)
+           rather than stochastic rounding.
+        """
+        # Validation logic from original function
+        if original_weight.dtype not in [torch.float8_e4m3fn, torch.float8_e5m2]:
+            raise ValueError("Unsupported dtype")
+
+        if target_weight.dtype != torch.bfloat16:
+            raise ValueError("target_weight dtype must be bfloat16")
+
+        # Implementation:
+        # 1. Cast original_weight (fp8) to target_weight dtype (bf16).
+        #    Since bf16 has higher dynamic range and precision than fp8, this upcast is exact.
+        # 2. Add in-place.
+        target_weight.add_(original_weight.to(target_weight.dtype))
+
+        return target_weight
 
 
 def calculate_weight_float8_(target_weights: torch.Tensor, original_weights: torch.Tensor) -> torch.Tensor:
diff --git a/packages/ltx-core/src/ltx_core/loader/kernels.py b/packages/ltx-core/src/ltx_core/loader/kernels.py
index ee4cefbe..c3d4de40 100644
--- a/packages/ltx-core/src/ltx_core/loader/kernels.py
+++ b/packages/ltx-core/src/ltx_core/loader/kernels.py
@@ -1,72 +1,151 @@
-# ruff: noqa: ANN001, ANN201, ERA001, N803, N806
-import triton
-import triton.language as tl
-
-
-@triton.jit
-def fused_add_round_kernel(
-    x_ptr,
-    output_ptr,  # contents will be added to the output
-    seed,
-    n_elements,
-    EXPONENT_BIAS,
-    MANTISSA_BITS,
-    BLOCK_SIZE: tl.constexpr,
-):
-    """
-    A kernel to upcast 8bit quantized weights to bfloat16 with stochastic rounding
-    and add them to bfloat16 output weights. Might be used to upcast original model weights
-    and to further add them to precalculated deltas coming from LoRAs.
-    """
-    # Get program ID and compute offsets
-    pid = tl.program_id(axis=0)
-    block_start = pid * BLOCK_SIZE
-    offsets = block_start + tl.arange(0, BLOCK_SIZE)
-    mask = offsets < n_elements
-
-    # Load data
-    x = tl.load(x_ptr + offsets, mask=mask)
-    rand_vals = tl.rand(seed, offsets) - 0.5
-
-    x = tl.cast(x, tl.float16)
-    delta = tl.load(output_ptr + offsets, mask=mask)
-    delta = tl.cast(delta, tl.float16)
-    x = x + delta
-
-    x_bits = tl.cast(x, tl.int16, bitcast=True)
-
-    # Calculate the exponent. Unbiased fp16 exponent is ((x_bits & 0x7C00) >> 10) - 15 for
-    # normal numbers and -14 for subnormals.
-    fp16_exponent_bits = (x_bits & 0x7C00) >> 10
-    fp16_normals = fp16_exponent_bits > 0
-    fp16_exponent = tl.where(fp16_normals, fp16_exponent_bits - 15, -14)
-
-    # Add the target dtype's exponent bias and clamp to the target dtype's exponent range.
-    exponent = fp16_exponent + EXPONENT_BIAS
-    MAX_EXPONENT = 2 * EXPONENT_BIAS + 1
-    exponent = tl.where(exponent > MAX_EXPONENT, MAX_EXPONENT, exponent)
-    exponent = tl.where(exponent < 0, 0, exponent)
-
-    # Normal ULP exponent, expressed as an fp16 exponent field:
-    # (exponent - EXPONENT_BIAS - MANTISSA_BITS) + 15
-    # Simplifies to: fp16_exponent - MANTISSA_BITS + 15
-    # See https://en.wikipedia.org/wiki/Unit_in_the_last_place
-    eps_exp = tl.maximum(0, tl.minimum(31, exponent - EXPONENT_BIAS - MANTISSA_BITS + 15))
-
-    # Calculate epsilon in the target dtype
-    eps_normal = tl.cast(tl.cast(eps_exp << 10, tl.int16), tl.float16, bitcast=True)
-
-    # Subnormal ULP: 2^(1 - EXPONENT_BIAS - MANTISSA_BITS) ->
-    # fp16 exponent bits: (1 - EXPONENT_BIAS - MANTISSA_BITS) + 15 =
-    # 16 - EXPONENT_BIAS - MANTISSA_BITS
-    eps_subnormal = tl.cast(tl.cast((16 - EXPONENT_BIAS - MANTISSA_BITS) << 10, tl.int16), tl.float16, bitcast=True)
-    eps = tl.where(exponent > 0, eps_normal, eps_subnormal)
-
-    # Apply zero mask to epsilon
-    eps = tl.where(x == 0, 0.0, eps)
-
-    # Apply stochastic rounding
-    output = tl.cast(x + rand_vals * eps, tl.bfloat16)
-
-    # Store the result
-    tl.store(output_ptr + offsets, output, mask=mask)
+try:
+    # ruff: noqa: ANN001, ANN201, ERA001, N803, N806
+    import triton
+    import triton.language as tl
+
+
+    @triton.jit
+    def fused_add_round_kernel(
+            x_ptr,
+            output_ptr,  # contents will be added to the output
+            seed,
+            n_elements,
+            EXPONENT_BIAS,
+            MANTISSA_BITS,
+            BLOCK_SIZE: tl.constexpr,
+    ):
+        """
+        A kernel to upcast 8bit quantized weights to bfloat16 with stochastic rounding
+        and add them to bfloat16 output weights. Might be used to upcast original model weights
+        and to further add them to precalculated deltas coming from LoRAs.
+        """
+        # Get program ID and compute offsets
+        pid = tl.program_id(axis=0)
+        block_start = pid * BLOCK_SIZE
+        offsets = block_start + tl.arange(0, BLOCK_SIZE)
+        mask = offsets < n_elements
+
+        # Load data
+        x = tl.load(x_ptr + offsets, mask=mask)
+        rand_vals = tl.rand(seed, offsets) - 0.5
+
+        x = tl.cast(x, tl.float16)
+        delta = tl.load(output_ptr + offsets, mask=mask)
+        delta = tl.cast(delta, tl.float16)
+        x = x + delta
+
+        x_bits = tl.cast(x, tl.int16, bitcast=True)
+
+        # Calculate the exponent. Unbiased fp16 exponent is ((x_bits & 0x7C00) >> 10) - 15 for
+        # normal numbers and -14 for subnormals.
+        fp16_exponent_bits = (x_bits & 0x7C00) >> 10
+        fp16_normals = fp16_exponent_bits > 0
+        fp16_exponent = tl.where(fp16_normals, fp16_exponent_bits - 15, -14)
+
+        # Add the target dtype's exponent bias and clamp to the target dtype's exponent range.
+        exponent = fp16_exponent + EXPONENT_BIAS
+        MAX_EXPONENT = 2 * EXPONENT_BIAS + 1
+        exponent = tl.where(exponent > MAX_EXPONENT, MAX_EXPONENT, exponent)
+        exponent = tl.where(exponent < 0, 0, exponent)
+
+        # Normal ULP exponent, expressed as an fp16 exponent field:
+        # (exponent - EXPONENT_BIAS - MANTISSA_BITS) + 15
+        # Simplifies to: fp16_exponent - MANTISSA_BITS + 15
+        # See https://en.wikipedia.org/wiki/Unit_in_the_last_place
+        eps_exp = tl.maximum(0, tl.minimum(31, exponent - EXPONENT_BIAS - MANTISSA_BITS + 15))
+
+        # Calculate epsilon in the target dtype
+        eps_normal = tl.cast(tl.cast(eps_exp << 10, tl.int16), tl.float16, bitcast=True)
+
+        # Subnormal ULP: 2^(1 - EXPONENT_BIAS - MANTISSA_BITS) ->
+        # fp16 exponent bits: (1 - EXPONENT_BIAS - MANTISSA_BITS) + 15 =
+        # 16 - EXPONENT_BIAS - MANTISSA_BITS
+        eps_subnormal = tl.cast(tl.cast((16 - EXPONENT_BIAS - MANTISSA_BITS) << 10, tl.int16), tl.float16, bitcast=True)
+        eps = tl.where(exponent > 0, eps_normal, eps_subnormal)
+
+        # Apply zero mask to epsilon
+        eps = tl.where(x == 0, 0.0, eps)
+
+        # Apply stochastic rounding
+        output = tl.cast(x + rand_vals * eps, tl.bfloat16)
+
+        # Store the result
+        tl.store(output_ptr + offsets, output, mask=mask)
+
+except Exception:
+    import torch
+
+    def fused_add_round_kernel(
+            x: torch.Tensor,
+            output: torch.Tensor,
+            seed: int,
+            n_elements: int,  # Kept for signature compatibility, but unused
+            EXPONENT_BIAS: int,
+            MANTISSA_BITS: int,
+            BLOCK_SIZE: int = None,  # Kept for signature compatibility, but unused
+    ):
+        """
+        Native PyTorch implementation of the fused_add_round_kernel.
+
+        This performs:
+        1. Upcast 8-bit weights (x) to match output precision.
+        2. Add output weights (deltas) to x.
+        3. Calculate the epsilon (quantization noise step) based on the target
+           Float8 parameters (EXPONENT_BIAS, MANTISSA_BITS).
+        4. Apply stochastic rounding (add noise proportional to epsilon).
+        5. Store back to output.
+        """
+
+        # 1. Setup Generators for stochastic rounding
+        # We use a specific generator to respect the seed argument
+        gen = torch.Generator(device=output.device).manual_seed(seed)
+
+        # 2. Load and Cast to calculation precision (Float32 for safety, or Float16)
+        # Using Float32 ensures high precision during the intermediate math
+        val_x = x.to(torch.float32)
+        val_delta = output.to(torch.float32)
+
+        # x = x + delta
+        val = val_x + val_delta
+
+        # 3. Calculate Epsilon (The Stochastic Rounding Step)
+        # The Triton kernel calculates epsilon based on the magnitude of 'val'
+        # mapped onto the specific Float8 exponent grid.
+
+        # Extract exponent: val = mantissa * 2^exp.
+        # torch.frexp returns exp such that 0.5 <= |mantissa| < 1.0.
+        # IEEE 754 log2(x) is (exp - 1).
+        _, exp_obj = torch.frexp(val)
+        unbiased_exp = exp_obj - 1
+
+        # Map to target Float8 exponent space
+        target_exp = unbiased_exp + EXPONENT_BIAS
+
+        # Clamp exponent to target dtype range.
+        # Max is standard formulation (2*Bias + 1).
+        # Min is 1. Why 1? In the original Triton kernel, subnormals (exp <= 0)
+        # utilize a constant epsilon calculated based on exponent=1 (the smallest normal).
+        max_exponent = 2 * EXPONENT_BIAS + 1
+        target_exp_clamped = torch.clamp(target_exp, min=1, max=max_exponent)
+
+        # Calculate ULP exponent: E_target - BIAS - Mantissa_Bits
+        eps_exponent = target_exp_clamped - EXPONENT_BIAS - MANTISSA_BITS
+
+        # Convert exponent to actual epsilon value: 2^eps_exponent
+        eps = torch.pow(2.0, eps_exponent.to(torch.float32))
+
+        # Mask epsilon where value is exactly 0 (matches `tl.where(x == 0, 0.0, eps)`)
+        eps = torch.where(val == 0, 0.0, eps)
+
+        # 4. Generate Random Noise [-0.5, 0.5]
+        rand_vals = torch.rand(val.shape, generator=gen, device=val.device) - 0.5
+
+        # 5. Apply Stochastic Rounding
+        # output = x + (noise * epsilon)
+        result = val + (rand_vals * eps)
+
+        # 6. Store Result
+        # In-place update of the output tensor, cast to bfloat16
+        output.copy_(result.to(torch.bfloat16))
+
+        # No return value needed as operation is in-place on output_ptr/output
\ No newline at end of file
diff --git a/packages/ltx-core/src/ltx_core/loader/single_gpu_model_builder.py b/packages/ltx-core/src/ltx_core/loader/single_gpu_model_builder.py
index 9e8853a4..03ca3a44 100644
--- a/packages/ltx-core/src/ltx_core/loader/single_gpu_model_builder.py
+++ b/packages/ltx-core/src/ltx_core/loader/single_gpu_model_builder.py
@@ -21,6 +21,9 @@
 
 logger: logging.Logger = logging.getLogger(__name__)
 
+from loguru import logger
+from accelerate import dispatch_model, infer_auto_device_map
+
 
 @dataclass(frozen=True)
 class SingleGPUModelBuilder(Generic[ModelType], ModelBuilderProtocol[ModelType], LoRAAdaptableProtocol):
@@ -69,33 +72,73 @@ def _return_model(self, meta_model: ModelType, device: torch.device) -> ModelTyp
         retval = meta_model.to(device)
         return retval
 
-    def build(self, device: torch.device | None = None, dtype: torch.dtype | None = None) -> ModelType:
-        device = torch.device("cuda") if device is None else device
+    def build(
+            self,
+            device: torch.device | None = None,
+            dtype: torch.dtype | None = None,
+            max_memory: dict[int | str, str] | None = None
+    ) -> ModelType:
+        target_device = torch.device("cuda") if device is None else device
         config = self.model_config()
         meta_model = self.meta_model(config, self.module_ops)
         model_paths = self.model_path if isinstance(self.model_path, tuple) else [self.model_path]
-        model_state_dict = self.load_sd(model_paths, sd_ops=self.model_sd_ops, registry=self.registry, device=device)
+        load_device = target_device if max_memory is None else torch.device("cpu")
+        model_state_dict = self.load_sd(model_paths, sd_ops=self.model_sd_ops, registry=self.registry,
+                                        device=load_device)
 
         lora_strengths = [lora.strength for lora in self.loras]
+        final_sd_map = {}
+
         if not lora_strengths or (min(lora_strengths) == 0 and max(lora_strengths) == 0):
-            sd = model_state_dict.sd
-            if dtype is not None:
-                sd = {key: value.to(dtype=dtype) for key, value in model_state_dict.sd.items()}
-            meta_model.load_state_dict(sd, strict=False, assign=True)
-            return self._return_model(meta_model, device)
-
-        lora_state_dicts = [
-            self.load_sd([lora.path], sd_ops=lora.sd_ops, registry=self.registry, device=device) for lora in self.loras
-        ]
-        lora_sd_and_strengths = [
-            LoraStateDictWithStrength(sd, strength)
-            for sd, strength in zip(lora_state_dicts, lora_strengths, strict=True)
-        ]
-        final_sd = apply_loras(
-            model_sd=model_state_dict,
-            lora_sd_and_strengths=lora_sd_and_strengths,
-            dtype=dtype,
-            destination_sd=model_state_dict if isinstance(self.registry, DummyRegistry) else None,
-        )
-        meta_model.load_state_dict(final_sd.sd, strict=False, assign=True)
-        return self._return_model(meta_model, device)
+            final_sd_map = model_state_dict.sd
+        else:
+            # Convert LoRAs to float32 on CPU to prevent slow BF16 emulation
+            lora_state_dicts = []
+            for lora in self.loras:
+                lsd = self.load_sd([lora.path], sd_ops=lora.sd_ops, registry=self.registry, device=load_device)
+
+                if load_device.type == "cpu":
+                    # In-place conversion of LoRA tensors to float32 for speed
+                    # This speeds up the matmul in apply_loras significantly
+                    for k, v in lsd.sd.items():
+                        if v.dtype in [torch.bfloat16, torch.float16]:
+                            lsd.sd[k] = v.to(dtype=torch.float32)
+
+                lora_state_dicts.append(lsd)
+
+            lora_sd_and_strengths = [
+                LoraStateDictWithStrength(sd, strength)
+                for sd, strength in zip(lora_state_dicts, lora_strengths, strict=True)
+            ]
+
+            dest_sd = model_state_dict if isinstance(self.registry, DummyRegistry) else None
+
+            final_sd_obj = apply_loras(
+                model_sd=model_state_dict,
+                lora_sd_and_strengths=lora_sd_and_strengths,
+                dtype=dtype,
+                destination_sd=dest_sd,
+            )
+            final_sd_map = final_sd_obj.sd
+
+        # 4. Cast Dtypes if requested
+        if dtype is not None:
+            final_sd_map = {k: v.to(dtype=dtype) for k, v in final_sd_map.items()}
+
+        # 5. Load State Dict into Model
+        meta_model.load_state_dict(final_sd_map, strict=False, assign=True)
+
+        # 6. Return based on Offloading strategy
+        if max_memory is not None:
+            logger.info(f"Dispatching model with max_memory constraints: {max_memory}")
+            no_split_modules = getattr(self.model_class_configurator, "no_split_modules", None)
+            device_map = infer_auto_device_map(
+                meta_model,
+                max_memory=max_memory,
+                no_split_module_classes=no_split_modules,
+                dtype=dtype
+            )
+            model = dispatch_model(meta_model, device_map=device_map)
+            return model
+
+        return self._return_model(meta_model, target_device)
diff --git a/packages/ltx-core/src/ltx_core/model/transformer/attention.py b/packages/ltx-core/src/ltx_core/model/transformer/attention.py
index 94204782..743f94b0 100644
--- a/packages/ltx-core/src/ltx_core/model/transformer/attention.py
+++ b/packages/ltx-core/src/ltx_core/model/transformer/attention.py
@@ -180,8 +180,10 @@ def forward(
     ) -> torch.Tensor:
         q = self.to_q(x)
         context = x if context is None else context
+        del x
         k = self.to_k(context)
         v = self.to_v(context)
+        del context
 
         q = self.q_norm(q)
         k = self.k_norm(k)
@@ -192,4 +194,6 @@ def forward(
 
         # attention_function can be an enum *or* a custom callable
         out = self.attention_function(q, k, v, self.heads, mask)
+        del q, k, v, mask
+
         return self.to_out(out)
diff --git a/packages/ltx-core/src/ltx_core/model/transformer/model.py b/packages/ltx-core/src/ltx_core/model/transformer/model.py
index 411e3b42..a8c991e1 100644
--- a/packages/ltx-core/src/ltx_core/model/transformer/model.py
+++ b/packages/ltx-core/src/ltx_core/model/transformer/model.py
@@ -16,6 +16,8 @@
 )
 from ltx_core.utils import to_denoised
 
+#from line_profiler import profile
+
 
 class LTXModelType(Enum):
     AudioVideo = "ltx av model"
@@ -35,6 +37,7 @@ class LTXModel(torch.nn.Module):
     This class implements the transformer blocks for the LTX model.
     """
 
+    #@profile 1.37738 s
     def __init__(  # noqa: PLR0913
         self,
         *,
@@ -105,7 +108,7 @@ def __init__(  # noqa: PLR0913
 
         self._init_preprocessors(cross_pe_max_pos)
         # Initialize transformer blocks
-        self._init_transformer_blocks(
+        self._init_transformer_blocks(  # 98.2%
             num_layers=num_layers,
             attention_head_dim=attention_head_dim if model_type.is_video_enabled() else 0,
             cross_attention_dim=cross_attention_dim,
@@ -115,6 +118,7 @@ def __init__(  # noqa: PLR0913
             attention_type=attention_type,
         )
 
+    #@profile 0.0069204 s
     def _init_video(
         self,
         in_channels: int,
@@ -139,6 +143,7 @@ def _init_video(
         self.norm_out = torch.nn.LayerNorm(self.inner_dim, elementwise_affine=False, eps=norm_eps)
         self.proj_out = torch.nn.Linear(self.inner_dim, out_channels)
 
+    #@profile 0.0063044 s
     def _init_audio(
         self,
         in_channels: int,
@@ -166,6 +171,7 @@ def _init_audio(
         self.audio_norm_out = torch.nn.LayerNorm(self.audio_inner_dim, elementwise_affine=False, eps=norm_eps)
         self.audio_proj_out = torch.nn.Linear(self.audio_inner_dim, out_channels)
 
+    #@profile 0.0111731 s
     def _init_audio_video(
         self,
         num_scale_shift_values: int,
@@ -191,6 +197,7 @@ def _init_audio_video(
             embedding_coefficient=1,
         )
 
+    #@profile 0.0002355 s
     def _init_preprocessors(
         self,
         cross_pe_max_pos: int | None = None,
@@ -263,6 +270,7 @@ def _init_preprocessors(
                 rope_type=self.rope_type,
             )
 
+    #@profile 1.3519 s
     def _init_transformer_blocks(
         self,
         num_layers: int,
@@ -296,7 +304,7 @@ def _init_transformer_blocks(
         )
         self.transformer_blocks = torch.nn.ModuleList(
             [
-                BasicAVTransformerBlock(
+                BasicAVTransformerBlock(  # 99.9%
                     idx=idx,
                     video=video_config,
                     audio=audio_config,
@@ -308,6 +316,7 @@ def _init_transformer_blocks(
             ]
         )
 
+    #@profile unused
     def set_gradient_checkpointing(self, enable: bool) -> None:
         """Enable or disable gradient checkpointing for transformer blocks.
         Gradient checkpointing trades compute for memory by recomputing activations
@@ -318,11 +327,13 @@ def set_gradient_checkpointing(self, enable: bool) -> None:
         """
         self._enable_gradient_checkpointing = enable
 
+    #@profile 498.557 s
     def _process_transformer_blocks(
         self,
         video: TransformerArgs | None,
         audio: TransformerArgs | None,
         perturbations: BatchedPerturbationConfig,
+        is_conditioning: bool = True
     ) -> tuple[TransformerArgs, TransformerArgs]:
         """Process transformer blocks for LTXAV."""
 
@@ -340,14 +351,16 @@ def _process_transformer_blocks(
                     use_reentrant=False,
                 )
             else:
-                video, audio = block(
+                video, audio = block(  # 100%
                     video=video,
                     audio=audio,
                     perturbations=perturbations,
+                    is_conditioning=is_conditioning,
                 )
 
         return video, audio
 
+    #@profile 0.0648487 s
     def _process_output(
         self,
         scale_shift_table: torch.Tensor,
@@ -368,8 +381,9 @@ def _process_output(
         x = proj_out(x)
         return x
 
+    #@profile 502.847 s
     def forward(
-        self, video: Modality | None, audio: Modality | None, perturbations: BatchedPerturbationConfig
+        self, video: Modality | None, audio: Modality | None, perturbations: BatchedPerturbationConfig, is_conditioning: bool = True
     ) -> tuple[torch.Tensor, torch.Tensor]:
         """
         Forward pass for LTX models.
@@ -384,10 +398,11 @@ def forward(
         video_args = self.video_args_preprocessor.prepare(video) if video is not None else None
         audio_args = self.audio_args_preprocessor.prepare(audio) if audio is not None else None
         # Process transformer blocks
-        video_out, audio_out = self._process_transformer_blocks(
+        video_out, audio_out = self._process_transformer_blocks(  # 99.1%
             video=video_args,
             audio=audio_args,
             perturbations=perturbations,
+            is_conditioning=is_conditioning,
         )
 
         # Process output
@@ -450,19 +465,20 @@ class X0Model(torch.nn.Module):
     def __init__(self, velocity_model: LTXModel):
         super().__init__()
         self.velocity_model = velocity_model
-
+    #@profile 502.854 s
     def forward(
         self,
         video: Modality | None,
         audio: Modality | None,
         perturbations: BatchedPerturbationConfig,
+        is_conditioning: bool = True
     ) -> tuple[torch.Tensor | None, torch.Tensor | None]:
         """
         Denoise the video and audio according to the sigma.
         Returns:
             Denoised video and audio
         """
-        vx, ax = self.velocity_model(video, audio, perturbations)
+        vx, ax = self.velocity_model(video, audio, perturbations, is_conditioning)  # 100%
         denoised_video = to_denoised(video.latent, vx, video.timesteps) if vx is not None else None
         denoised_audio = to_denoised(audio.latent, ax, audio.timesteps) if ax is not None else None
         return denoised_video, denoised_audio
diff --git a/packages/ltx-core/src/ltx_core/model/transformer/model_configurator.py b/packages/ltx-core/src/ltx_core/model/transformer/model_configurator.py
index 567f436a..4ca168ce 100644
--- a/packages/ltx-core/src/ltx_core/model/transformer/model_configurator.py
+++ b/packages/ltx-core/src/ltx_core/model/transformer/model_configurator.py
@@ -126,6 +126,9 @@ def _upcast_and_round(
     Upcast the weight to the given dtype and optionally apply stochastic rounding.
     Input weight needs to have float8_e4m3fn or float8_e5m2 dtype.
     """
+    if weight.dtype == dtype:
+        return weight
+
     if not with_stochastic_rounding:
         return weight.to(dtype)
     return fused_add_round_launch(torch.zeros_like(weight, dtype=dtype), weight, seed)
diff --git a/packages/ltx-core/src/ltx_core/model/transformer/rope.py b/packages/ltx-core/src/ltx_core/model/transformer/rope.py
index 2ce58d90..aac53d3f 100644
--- a/packages/ltx-core/src/ltx_core/model/transformer/rope.py
+++ b/packages/ltx-core/src/ltx_core/model/transformer/rope.py
@@ -40,6 +40,37 @@ def apply_interleaved_rotary_emb(
 
 
 def apply_split_rotary_emb(
+    input_tensor: torch.Tensor,
+    cos_freqs: torch.Tensor,
+    sin_freqs: torch.Tensor
+) -> torch.Tensor:
+    needs_reshape = False
+    orig_shape = None
+    if input_tensor.ndim != 4 and cos_freqs.ndim == 4:
+        b, h, t, _ = cos_freqs.shape
+        orig_shape = (b, t, -1)
+        input_tensor = input_tensor.view(b, t, h, -1).transpose(1, 2)
+        needs_reshape = True
+
+    d = input_tensor.shape[-1]
+    half = d // 2
+    x_even = input_tensor[..., :half]
+    x_odd = input_tensor[..., half:]
+
+    even_rot = x_even * cos_freqs - x_odd * sin_freqs
+    odd_rot = x_even * sin_freqs + x_odd * cos_freqs
+
+    output = torch.empty_like(input_tensor)
+    output[..., :half] = even_rot
+    output[..., half:] = odd_rot
+
+    if needs_reshape:
+        output = output.transpose(1, 2).reshape(*orig_shape)
+
+    return output
+
+
+def apply_split_rotary_emb_(
     input_tensor: torch.Tensor, cos_freqs: torch.Tensor, sin_freqs: torch.Tensor
 ) -> torch.Tensor:
     needs_reshape = False
diff --git a/packages/ltx-core/src/ltx_core/model/transformer/transformer.py b/packages/ltx-core/src/ltx_core/model/transformer/transformer.py
index 047faaab..ef71e41c 100644
--- a/packages/ltx-core/src/ltx_core/model/transformer/transformer.py
+++ b/packages/ltx-core/src/ltx_core/model/transformer/transformer.py
@@ -20,13 +20,13 @@ class TransformerConfig:
 
 class BasicAVTransformerBlock(torch.nn.Module):
     def __init__(
-        self,
-        idx: int,
-        video: TransformerConfig | None = None,
-        audio: TransformerConfig | None = None,
-        rope_type: LTXRopeType = LTXRopeType.INTERLEAVED,
-        norm_eps: float = 1e-6,
-        attention_function: AttentionFunction | AttentionCallable = AttentionFunction.DEFAULT,
+            self,
+            idx: int,
+            video: TransformerConfig | None = None,
+            audio: TransformerConfig | None = None,
+            rope_type: LTXRopeType = LTXRopeType.INTERLEAVED,
+            norm_eps: float = 1e-6,
+            attention_function: AttentionFunction | AttentionCallable = AttentionFunction.DEFAULT,
     ):
         super().__init__()
 
@@ -76,7 +76,6 @@ def __init__(
             self.audio_scale_shift_table = torch.nn.Parameter(torch.empty(6, audio.dim))
 
         if audio is not None and video is not None:
-            # Q: Video, K,V: Audio
             self.audio_to_video_attn = Attention(
                 query_dim=video.dim,
                 context_dim=audio.dim,
@@ -86,8 +85,6 @@ def __init__(
                 norm_eps=norm_eps,
                 attention_function=attention_function,
             )
-
-            # Q: Audio, K,V: Video
             self.video_to_audio_attn = Attention(
                 query_dim=audio.dim,
                 context_dim=video.dim,
@@ -97,36 +94,68 @@ def __init__(
                 norm_eps=norm_eps,
                 attention_function=attention_function,
             )
-
             self.scale_shift_table_a2v_ca_audio = torch.nn.Parameter(torch.empty(5, audio.dim))
             self.scale_shift_table_a2v_ca_video = torch.nn.Parameter(torch.empty(5, video.dim))
 
         self.norm_eps = norm_eps
 
     def get_ada_values(
-        self, scale_shift_table: torch.Tensor, batch_size: int, timestep: torch.Tensor, indices: slice
+            self, scale_shift_table: torch.Tensor, batch_size: int, timestep: torch.Tensor, indices: slice, is_conditioning: bool = False
     ) -> tuple[torch.Tensor, ...]:
         num_ada_params = scale_shift_table.shape[0]
 
+        if is_conditioning == False:
+            if timestep.dim() > 2 and timestep.shape[1] > 1:
+                timestep = timestep[:, 0:1, ...]
+
+        table_slice = scale_shift_table[indices]
+        if table_slice.device != timestep.device or table_slice.dtype != timestep.dtype:
+            table_slice = table_slice.to(device=timestep.device, dtype=timestep.dtype)
+
         ada_values = (
-            scale_shift_table[indices].unsqueeze(0).unsqueeze(0).to(device=timestep.device, dtype=timestep.dtype)
-            + timestep.reshape(batch_size, timestep.shape[1], num_ada_params, -1)[:, :, indices, :]
+                table_slice.unsqueeze(0).unsqueeze(0)
+                + timestep.reshape(batch_size, timestep.shape[1], num_ada_params, -1)[:, :, indices, :]
         ).unbind(dim=2)
         return ada_values
 
+    def get_ada_values_(
+            self,
+            scale_shift_table: torch.Tensor,
+            batch_size: int,
+            timestep: torch.Tensor,
+            indices: slice,
+            is_conditioning: bool = False
+    ) -> tuple[torch.Tensor, ...]:
+        if not is_conditioning and timestep.dim() > 2 and timestep.shape[1] > 1:
+            timestep = timestep[:, 0:1]
+
+        table_slice = scale_shift_table[indices]
+
+        if table_slice.device != timestep.device or table_slice.dtype != timestep.dtype:
+            table_slice = table_slice.to(device=timestep.device, dtype=timestep.dtype, non_blocking=True)
+
+        ts_view = timestep.reshape(batch_size, timestep.shape[1], scale_shift_table.shape[0], -1)
+        ts_chunk = ts_view[:, :, indices]
+
+        return tuple(
+            chunk.add(param)
+            for chunk, param in zip(ts_chunk.unbind(2), table_slice)
+        )
+
     def get_av_ca_ada_values(
-        self,
-        scale_shift_table: torch.Tensor,
-        batch_size: int,
-        scale_shift_timestep: torch.Tensor,
-        gate_timestep: torch.Tensor,
-        num_scale_shift_values: int = 4,
+            self,
+            scale_shift_table: torch.Tensor,
+            batch_size: int,
+            scale_shift_timestep: torch.Tensor,
+            gate_timestep: torch.Tensor,
+            num_scale_shift_values: int = 4,
+            is_conditioning: bool = True
     ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
         scale_shift_ada_values = self.get_ada_values(
-            scale_shift_table[:num_scale_shift_values, :], batch_size, scale_shift_timestep, slice(None, None)
+            scale_shift_table[:num_scale_shift_values, :], batch_size, scale_shift_timestep, slice(None, None), is_conditioning=is_conditioning
         )
         gate_ada_values = self.get_ada_values(
-            scale_shift_table[num_scale_shift_values:, :], batch_size, gate_timestep, slice(None, None)
+            scale_shift_table[num_scale_shift_values:, :], batch_size, gate_timestep, slice(None, None), is_conditioning=is_conditioning
         )
 
         scale_shift_chunks = [t.squeeze(2) for t in scale_shift_ada_values]
@@ -135,12 +164,13 @@ def get_av_ca_ada_values(
         return (*scale_shift_chunks, *gate_ada_values)
 
     def forward(  # noqa: PLR0915
-        self,
-        video: TransformerArgs | None,
-        audio: TransformerArgs | None,
-        perturbations: BatchedPerturbationConfig | None = None,
+            self,
+            video: TransformerArgs | None,
+            audio: TransformerArgs | None,
+            perturbations: BatchedPerturbationConfig | None = None,
+            is_conditioning: bool = True
     ) -> tuple[TransformerArgs | None, TransformerArgs | None]:
-        batch_size = video.x.shape[0]
+        batch_size = video.x.shape[0] if video is not None else (audio.x.shape[0] if audio is not None else 0)
         if perturbations is None:
             perturbations = BatchedPerturbationConfig.empty(batch_size)
 
@@ -153,122 +183,197 @@ def forward(  # noqa: PLR0915
         run_a2v = run_vx and (audio is not None and ax.numel() > 0)
         run_v2a = run_ax and (video is not None and vx.numel() > 0)
 
+        # --- Video Self-Attention & Cross-Attention ---
         if run_vx:
-            vshift_msa, vscale_msa, vgate_msa = self.get_ada_values(
-                self.scale_shift_table, vx.shape[0], video.timesteps, slice(0, 3)
-            )
             if not perturbations.all_in_batch(PerturbationType.SKIP_VIDEO_SELF_ATTN, self.idx):
-                norm_vx = rms_norm(vx, eps=self.norm_eps) * (1 + vscale_msa) + vshift_msa
+                vshift_msa, vscale_msa, vgate_msa = self.get_ada_values(
+                    self.scale_shift_table, vx.shape[0], video.timesteps, slice(0, 3), is_conditioning=is_conditioning
+                )
+                norm_vx = rms_norm(vx, eps=self.norm_eps)
+                norm_vx.mul_(1 + vscale_msa).add_(vshift_msa)
+
                 v_mask = perturbations.mask_like(PerturbationType.SKIP_VIDEO_SELF_ATTN, self.idx, vx)
-                vx = vx + self.attn1(norm_vx, pe=video.positional_embeddings) * vgate_msa * v_mask
+                attn_out = self.attn1(norm_vx, pe=video.positional_embeddings)
+                del norm_vx
+                vx = vx + attn_out * vgate_msa * v_mask
 
-            vx = vx + self.attn2(rms_norm(vx, eps=self.norm_eps), context=video.context, mask=video.context_mask)
+                del attn_out, v_mask, vgate_msa
 
-            del vshift_msa, vscale_msa, vgate_msa
+            vx = vx + self.attn2(rms_norm(vx, eps=self.norm_eps), context=video.context, mask=video.context_mask)
 
+        # --- Audio Self-Attention & Cross-Attention ---
         if run_ax:
             ashift_msa, ascale_msa, agate_msa = self.get_ada_values(
-                self.audio_scale_shift_table, ax.shape[0], audio.timesteps, slice(0, 3)
+                self.audio_scale_shift_table, ax.shape[0], audio.timesteps, slice(0, 3), is_conditioning=is_conditioning
             )
 
             if not perturbations.all_in_batch(PerturbationType.SKIP_AUDIO_SELF_ATTN, self.idx):
-                norm_ax = rms_norm(ax, eps=self.norm_eps) * (1 + ascale_msa) + ashift_msa
+                norm_ax = rms_norm(ax, eps=self.norm_eps)
+                norm_ax.mul_(1 + ascale_msa).add_(ashift_msa)
+                del ashift_msa, ascale_msa
+
                 a_mask = perturbations.mask_like(PerturbationType.SKIP_AUDIO_SELF_ATTN, self.idx, ax)
                 ax = ax + self.audio_attn1(norm_ax, pe=audio.positional_embeddings) * agate_msa * a_mask
 
-            ax = ax + self.audio_attn2(rms_norm(ax, eps=self.norm_eps), context=audio.context, mask=audio.context_mask)
+                del norm_ax, agate_msa, a_mask
 
-            del ashift_msa, ascale_msa, agate_msa
+            # Audio Context Attention
+            ax = ax + self.audio_attn2(rms_norm(ax, eps=self.norm_eps), context=audio.context, mask=audio.context_mask)
 
-        # Audio - Video cross attention.
+        # --- Audio - Video Cross Attention (MEMORY OPTIMIZED) ---
         if run_a2v or run_v2a:
+            # These norms are allocated fresh.
             vx_norm3 = rms_norm(vx, eps=self.norm_eps)
             ax_norm3 = rms_norm(ax, eps=self.norm_eps)
 
-            (
-                scale_ca_audio_hidden_states_a2v,
-                shift_ca_audio_hidden_states_a2v,
-                scale_ca_audio_hidden_states_v2a,
-                shift_ca_audio_hidden_states_v2a,
-                gate_out_v2a,
-            ) = self.get_av_ca_ada_values(
-                self.scale_shift_table_a2v_ca_audio,
-                ax.shape[0],
-                audio.cross_scale_shift_timestep,
-                audio.cross_gate_timestep,
-            )
+            # Helper to process A2V
+            if run_a2v:
+                (
+                    scale_ca_audio_hidden_states_a2v,
+                    shift_ca_audio_hidden_states_a2v,
+                    _,
+                    _,
+                    _,
+                ) = self.get_av_ca_ada_values(
+                    self.scale_shift_table_a2v_ca_audio,
+                    ax.shape[0],
+                    audio.cross_scale_shift_timestep,
+                    audio.cross_gate_timestep,
+                    is_conditioning=is_conditioning,
+                )
 
-            (
-                scale_ca_video_hidden_states_a2v,
-                shift_ca_video_hidden_states_a2v,
-                scale_ca_video_hidden_states_v2a,
-                shift_ca_video_hidden_states_v2a,
-                gate_out_a2v,
-            ) = self.get_av_ca_ada_values(
-                self.scale_shift_table_a2v_ca_video,
-                vx.shape[0],
-                video.cross_scale_shift_timestep,
-                video.cross_gate_timestep,
-            )
+                (
+                    scale_ca_video_hidden_states_a2v,
+                    shift_ca_video_hidden_states_a2v,
+                    _,
+                    _,
+                    gate_out_a2v,
+                ) = self.get_av_ca_ada_values(
+                    self.scale_shift_table_a2v_ca_video,
+                    vx.shape[0],
+                    video.cross_scale_shift_timestep,
+                    video.cross_gate_timestep,
+                    is_conditioning=is_conditioning,
+                )
 
-            if run_a2v:
-                vx_scaled = vx_norm3 * (1 + scale_ca_video_hidden_states_a2v) + shift_ca_video_hidden_states_a2v
-                ax_scaled = ax_norm3 * (1 + scale_ca_audio_hidden_states_a2v) + shift_ca_audio_hidden_states_a2v
                 a2v_mask = perturbations.mask_like(PerturbationType.SKIP_A2V_CROSS_ATTN, self.idx, vx)
-                vx = vx + (
-                    self.audio_to_video_attn(
+
+                # OPTIMIZATION: If V2A is NOT running, we can modify vx_norm3/ax_norm3 in-place.
+                # This prevents allocating 'vx_scaled' and 'ax_scaled' buffers.
+                if not run_v2a:
+                    vx_norm3.mul_(1 + scale_ca_video_hidden_states_a2v).add_(shift_ca_video_hidden_states_a2v)
+                    ax_norm3.mul_(1 + scale_ca_audio_hidden_states_a2v).add_(shift_ca_audio_hidden_states_a2v)
+
+                    attn_out = self.audio_to_video_attn(
+                        vx_norm3,
+                        context=ax_norm3,
+                        pe=video.cross_positional_embeddings,
+                        k_pe=audio.cross_positional_embeddings
+                    )
+                else:
+                    # If V2A is running, we need the original norms for it, so we must allocate new scaled tensors.
+                    vx_scaled = vx_norm3 * (1 + scale_ca_video_hidden_states_a2v) + shift_ca_video_hidden_states_a2v
+                    ax_scaled = ax_norm3 * (1 + scale_ca_audio_hidden_states_a2v) + shift_ca_audio_hidden_states_a2v
+
+                    attn_out = self.audio_to_video_attn(
                         vx_scaled,
                         context=ax_scaled,
                         pe=video.cross_positional_embeddings,
-                        k_pe=audio.cross_positional_embeddings,
+                        k_pe=audio.cross_positional_embeddings
                     )
-                    * gate_out_a2v
-                    * a2v_mask
-                )
+                    del vx_scaled, ax_scaled
+
+                vx = vx + attn_out * gate_out_a2v * a2v_mask
+
+                del scale_ca_video_hidden_states_a2v, shift_ca_video_hidden_states_a2v
+                del scale_ca_audio_hidden_states_a2v, shift_ca_audio_hidden_states_a2v
+                del gate_out_a2v, a2v_mask, attn_out
 
+            # Helper to process V2A
             if run_v2a:
-                ax_scaled = ax_norm3 * (1 + scale_ca_audio_hidden_states_v2a) + shift_ca_audio_hidden_states_v2a
-                vx_scaled = vx_norm3 * (1 + scale_ca_video_hidden_states_v2a) + shift_ca_video_hidden_states_v2a
+                (
+                    _,
+                    _,
+                    scale_ca_audio_hidden_states_v2a,
+                    shift_ca_audio_hidden_states_v2a,
+                    gate_out_v2a,
+                ) = self.get_av_ca_ada_values(
+                    self.scale_shift_table_a2v_ca_audio,
+                    ax.shape[0],
+                    audio.cross_scale_shift_timestep,
+                    audio.cross_gate_timestep,
+                    is_conditioning=is_conditioning,
+                )
+
+                (
+                    _,
+                    _,
+                    scale_ca_video_hidden_states_v2a,
+                    shift_ca_video_hidden_states_v2a,
+                    _,
+                ) = self.get_av_ca_ada_values(
+                    self.scale_shift_table_a2v_ca_video,
+                    vx.shape[0],
+                    video.cross_scale_shift_timestep,
+                    video.cross_gate_timestep,
+                    is_conditioning=is_conditioning,
+                )
+
                 v2a_mask = perturbations.mask_like(PerturbationType.SKIP_V2A_CROSS_ATTN, self.idx, ax)
-                ax = ax + (
-                    self.video_to_audio_attn(
+
+                # OPTIMIZATION: If A2V did NOT run, we can use the norms in-place.
+                if not run_a2v:
+                    ax_norm3.mul_(1 + scale_ca_audio_hidden_states_v2a).add_(shift_ca_audio_hidden_states_v2a)
+                    vx_norm3.mul_(1 + scale_ca_video_hidden_states_v2a).add_(shift_ca_video_hidden_states_v2a)
+
+                    attn_out = self.video_to_audio_attn(
+                        ax_norm3,
+                        context=vx_norm3,
+                        pe=audio.cross_positional_embeddings,
+                        k_pe=video.cross_positional_embeddings
+                    )
+                else:
+                    # Both A2V and V2A ran. A2V preserved the norms (because of the `else` block above).
+                    # So we still have the original norms here. We must allocate new.
+                    ax_scaled = ax_norm3 * (1 + scale_ca_audio_hidden_states_v2a) + shift_ca_audio_hidden_states_v2a
+                    vx_scaled = vx_norm3 * (1 + scale_ca_video_hidden_states_v2a) + shift_ca_video_hidden_states_v2a
+
+                    attn_out = self.video_to_audio_attn(
                         ax_scaled,
                         context=vx_scaled,
                         pe=audio.cross_positional_embeddings,
-                        k_pe=video.cross_positional_embeddings,
+                        k_pe=video.cross_positional_embeddings
                     )
-                    * gate_out_v2a
-                    * v2a_mask
-                )
+                    del ax_scaled, vx_scaled
 
-            del gate_out_a2v, gate_out_v2a
-            del (
-                scale_ca_video_hidden_states_a2v,
-                shift_ca_video_hidden_states_a2v,
-                scale_ca_audio_hidden_states_a2v,
-                shift_ca_audio_hidden_states_a2v,
-                scale_ca_video_hidden_states_v2a,
-                shift_ca_video_hidden_states_v2a,
-                scale_ca_audio_hidden_states_v2a,
-                shift_ca_audio_hidden_states_v2a,
-            )
+                ax = ax + attn_out * gate_out_v2a * v2a_mask
+
+                del scale_ca_video_hidden_states_v2a, shift_ca_video_hidden_states_v2a
+                del scale_ca_audio_hidden_states_v2a, shift_ca_audio_hidden_states_v2a
+                del gate_out_v2a, v2a_mask, attn_out
 
+            del vx_norm3, ax_norm3
+
+        # --- FFN Layers ---
         if run_vx:
             vshift_mlp, vscale_mlp, vgate_mlp = self.get_ada_values(
-                self.scale_shift_table, vx.shape[0], video.timesteps, slice(3, None)
+                self.scale_shift_table, vx.shape[0], video.timesteps, slice(3, None), is_conditioning=is_conditioning
             )
-            vx_scaled = rms_norm(vx, eps=self.norm_eps) * (1 + vscale_mlp) + vshift_mlp
+            vx_scaled = rms_norm(vx, eps=self.norm_eps)
+            vx_scaled.mul_(1 + vscale_mlp).add_(vshift_mlp)
+            del vscale_mlp, vshift_mlp
             vx = vx + self.ff(vx_scaled) * vgate_mlp
 
-            del vshift_mlp, vscale_mlp, vgate_mlp
+            del vx_scaled
 
         if run_ax:
             ashift_mlp, ascale_mlp, agate_mlp = self.get_ada_values(
-                self.audio_scale_shift_table, ax.shape[0], audio.timesteps, slice(3, None)
+                self.audio_scale_shift_table, ax.shape[0], audio.timesteps, slice(3, None), is_conditioning=is_conditioning
             )
-            ax_scaled = rms_norm(ax, eps=self.norm_eps) * (1 + ascale_mlp) + ashift_mlp
+            ax_scaled = rms_norm(ax, eps=self.norm_eps)
+            ax_scaled.mul_(1 + ascale_mlp).add_(ashift_mlp)
+            del ashift_mlp, ascale_mlp
             ax = ax + self.audio_ff(ax_scaled) * agate_mlp
-
-            del ashift_mlp, ascale_mlp, agate_mlp
+            del agate_mlp, ax_scaled
 
         return replace(video, x=vx) if video is not None else None, replace(audio, x=ax) if audio is not None else None
diff --git a/packages/ltx-core/src/ltx_core/model/transformer/transformer_args.py b/packages/ltx-core/src/ltx_core/model/transformer/transformer_args.py
index ade5aa77..d24de705 100644
--- a/packages/ltx-core/src/ltx_core/model/transformer/transformer_args.py
+++ b/packages/ltx-core/src/ltx_core/model/transformer/transformer_args.py
@@ -84,13 +84,15 @@ def _prepare_context(
         return context, attention_mask
 
     def _prepare_attention_mask(self, attention_mask: torch.Tensor | None, x_dtype: torch.dtype) -> torch.Tensor | None:
-        """Prepare attention mask."""
         if attention_mask is None or torch.is_floating_point(attention_mask):
             return attention_mask
 
-        return (attention_mask - 1).to(x_dtype).reshape(
+        # Allocate once for the cast and reshape, then scale in-place
+        mask = (attention_mask - 1).to(x_dtype).reshape(
             (attention_mask.shape[0], 1, -1, attention_mask.shape[-1])
-        ) * torch.finfo(x_dtype).max
+        )
+        mask.mul_(torch.finfo(x_dtype).max)
+        return mask
 
     def _prepare_positional_embeddings(
         self,
diff --git a/packages/ltx-core/src/ltx_core/text_encoders/gemma/encoders/base_encoder.py b/packages/ltx-core/src/ltx_core/text_encoders/gemma/encoders/base_encoder.py
index e689c1af..29a6d9d7 100644
--- a/packages/ltx-core/src/ltx_core/text_encoders/gemma/encoders/base_encoder.py
+++ b/packages/ltx-core/src/ltx_core/text_encoders/gemma/encoders/base_encoder.py
@@ -245,7 +245,11 @@ def module_ops_from_gemma_root(gemma_root: str) -> tuple[ModuleOps, ...]:
 
     def load_gemma(module: GemmaTextEncoderModelBase) -> GemmaTextEncoderModelBase:
         module.model = Gemma3ForConditionalGeneration.from_pretrained(
-            gemma_path, local_files_only=True, torch_dtype=torch.bfloat16
+            gemma_path,
+            local_files_only=True,
+            torch_dtype=torch.bfloat16,
+            device_map="auto",
+            max_memory={0: "2GiB", "cpu": "32GiB"}
         )
         module._gemma_root = module._gemma_root or gemma_root
         return module
diff --git a/packages/ltx-core/src/ltx_core/text_encoders/gemma/encoders/prompts/gemma_t2v_system_prompt.txt b/packages/ltx-core/src/ltx_core/text_encoders/gemma/encoders/prompts/gemma_t2v_system_prompt.txt
index e8642019..f16acd88 100644
--- a/packages/ltx-core/src/ltx_core/text_encoders/gemma/encoders/prompts/gemma_t2v_system_prompt.txt
+++ b/packages/ltx-core/src/ltx_core/text_encoders/gemma/encoders/prompts/gemma_t2v_system_prompt.txt
@@ -1,40 +1,23 @@
-You are a Creative Assistant. Given a user's raw input prompt describing a scene or concept, expand it into a detailed video generation prompt with specific visuals and integrated audio to guide a text-to-video model.
-
-#### Guidelines
-- Strictly follow all aspects of the user's raw input: include every element requested (style, visuals, motions, actions, camera movement, audio).
-    - If the input is vague, invent concrete details: lighting, textures, materials, scene settings, etc.
-        - For characters: describe gender, clothing, hair, expressions. DO NOT invent unrequested characters.
-- Use active language: present-progressive verbs ("is walking," "speaking"). If no action specified, describe natural movements.
-- Maintain chronological flow: use temporal connectors ("as," "then," "while").
-- Audio layer: Describe complete soundscape (background audio, ambient sounds, SFX, speech/music when requested). Integrate sounds chronologically alongside actions. Be specific (e.g., "soft footsteps on tile"), not vague (e.g., "ambient sound is present").
-- Speech (only when requested):
-    - For ANY speech-related input (talking, conversation, singing, etc.), ALWAYS include exact words in quotes with voice characteristics (e.g., "The man says in an excited voice: 'You won't believe what I just saw!'").
-    - Specify language if not English and accent if relevant.
-- Style: Include visual style at the beginning: "Style: <style>, <rest of prompt>." Default to cinematic-realistic if unspecified. Omit if unclear.
-- Visual and audio only: NO non-visual/auditory senses (smell, taste, touch).
-- Restrained language: Avoid dramatic/exaggerated terms. Use mild, natural phrasing.
-    - Colors: Use plain terms ("red dress"), not intensified ("vibrant blue," "bright red").
-    - Lighting: Use neutral descriptions ("soft overhead light"), not harsh ("blinding light").
-    - Facial features: Use delicate modifiers for subtle features (i.e., "subtle freckles").
-
-#### Important notes:
-- Analyze the user's raw input carefully. In cases of FPV or POV, exclude the description of the subject whose POV is requested.
-- Camera motion: DO NOT invent camera motion unless requested by the user.
-- Speech: DO NOT modify user-provided character dialogue unless it's a typo.
-- No timestamps or cuts: DO NOT use timestamps or describe scene cuts unless explicitly requested.
-- Format: DO NOT use phrases like "The scene opens with...". Start directly with Style (optional) and chronological scene description.
-- Format: DO NOT start your response with special characters.
-- DO NOT invent dialogue unless the user mentions speech/talking/singing/conversation.
-- If the user's raw input prompt is highly detailed, chronological and in the requested format: DO NOT make major edits or introduce new elements. Add/enhance audio descriptions if missing.
-
-#### Output Format (Strict):
-- Single continuous paragraph in natural language (English).
-- NO titles, headings, prefaces, code fences, or Markdown.
-- If unsafe/invalid, return original user prompt. Never ask questions or clarifications.
-
-Your output quality is CRITICAL. Generate visually rich, dynamic prompts with integrated audio for high-quality video generation.
-
-#### Example
+You are a Creative Assistant. Expand the user's raw input into a detailed video generation prompt with integrated audio.
+Output Structure (Strict):
+Provide a single continuous paragraph in natural English.
+NO markdown, titles, code fences, or timestamps.
+NO meta-phrases (e.g., "The scene opens," "A video of").
+Start immediately with: "Style: [style], [description]." (Default to "cinematic-realistic" if unspecified).
+Content Guidelines:
+Visuals & Logic: Strictly follow user requests. For vague inputs, invent concrete details (lighting, textures, materials, clothing, setting). Maintain chronological flow using temporal connectors ("as," "while"). For POV/FPV inputs, exclude the subject's description.
+Language: Use active, present-progressive verbs ("is walking"). Use restrained, natural phrasing; avoid exaggerated adjectives (e.g., use "red" instead of "vibrant red"; "soft light" instead of "blinding light"). Do not describe non-visual/auditory senses (smell/touch).
+Audio: Integrate specific ambient sounds and SFX chronologically alongside actions (e.g., "heavy footsteps on gravel"). Avoid vague descriptions like "background sound."
+Speech:
+If requested/implied: Always include exact words in quotes with specific voice/tone descriptors (e.g., "She speaks in a raspy whisper: 'Hello.'"). Specify language/accent if needed.
+If NOT requested: Do not invent dialogue.
+Constraints:
+DO NOT invent camera movement unless requested.
+DO NOT invent unrequested characters.
+If the user input is already detailed/formatted, do not alter visuals; only enhance audio/flow.
+If unsafe, return the original prompt.
+Max lel for output 512 tokens.
+Example:
 Input: "A woman at a coffee shop talking on the phone"
 Output:
-Style: realistic with cinematic lighting. In a medium close-up, a woman in her early 30s with shoulder-length brown hair sits at a small wooden table by the window. She wears a cream-colored turtleneck sweater, holding a white ceramic coffee cup in one hand and a smartphone to her ear with the other. Ambient cafe sounds fill the space—espresso machine hiss, quiet conversations, gentle clinking of cups. The woman listens intently, nodding slightly, then takes a sip of her coffee and sets it down with a soft clink. Her face brightens into a warm smile as she speaks in a clear, friendly voice, 'That sounds perfect! I'd love to meet up this weekend. How about Saturday afternoon?' She laughs softly—a genuine chuckle—and shifts in her chair. Behind her, other patrons move subtly in and out of focus. 'Great, I'll see you then,' she concludes cheerfully, lowering the phone.
+Style: realistic with cinematic lighting. In a medium close-up, a woman in her 30s with shoulder-length brown hair sits at a wooden table. She wears a cream turtleneck, holding a coffee cup and a phone. Ambient cafe sounds fill the space—espresso machine hiss, quiet chatter, clinking cups. She listens intently, nods, then sips her coffee. Her face brightens as she speaks in a friendly voice, 'That sounds perfect! Saturday afternoon?' She chuckles softly. Behind her, patrons move out of focus. 'Great, see you then,' she says, lowering the phone.
\ No newline at end of file
diff --git a/packages/ltx-core/src/ltx_core/text_encoders/gemma/feature_extractor.py b/packages/ltx-core/src/ltx_core/text_encoders/gemma/feature_extractor.py
index ab41dd6d..2b66fbd7 100644
--- a/packages/ltx-core/src/ltx_core/text_encoders/gemma/feature_extractor.py
+++ b/packages/ltx-core/src/ltx_core/text_encoders/gemma/feature_extractor.py
@@ -19,7 +19,7 @@ def __init__(self) -> None:
         The input dimension is expected to be 3840 * 49, and the output is 3840.
         """
         super().__init__()
-        self.aggregate_embed = torch.nn.Linear(3840 * 49, 3840, bias=False)
+        self.aggregate_embed = torch.nn.Linear(3840 * 49, 3840, bias=False, dtype=torch.bfloat16)
 
     def forward(self, x: torch.Tensor) -> torch.Tensor:
         """
@@ -29,6 +29,7 @@ def forward(self, x: torch.Tensor) -> torch.Tensor:
         Returns:
             torch.Tensor: Output tensor of shape (batch_size, 3840).
         """
+        print(x.dtype)
         return self.aggregate_embed(x)
 
     @classmethod
diff --git a/packages/ltx-pipelines/src/ltx_pipelines/distilled.py b/packages/ltx-pipelines/src/ltx_pipelines/distilled.py
index e3b51ba8..d881e142 100644
--- a/packages/ltx-pipelines/src/ltx_pipelines/distilled.py
+++ b/packages/ltx-pipelines/src/ltx_pipelines/distilled.py
@@ -1,5 +1,10 @@
 import logging
+import time
+import hashlib
+import os
+
 from collections.abc import Iterator
+import numpy as np
 
 import torch
 
@@ -35,6 +40,10 @@
 
 device = get_device()
 
+logging.basicConfig(level=logging.ERROR)
+logging.getLogger("accelerate").setLevel(logging.ERROR)
+logging.getLogger("ltx_core").setLevel(logging.ERROR)
+
 
 class DistilledPipeline:
     """
@@ -70,18 +79,44 @@ def __init__(
             device=device,
         )
 
+    def get_interpolated_sigmas(self, num_steps: int, device: torch.device) -> torch.Tensor:
+        original_sigmas = DISTILLED_SIGMA_VALUES
+        old_x = np.linspace(0, 1, len(original_sigmas))
+        new_x = np.linspace(0, 1, num_steps + 1)
+        new_sigmas = np.interp(new_x, old_x, original_sigmas)
+        new_sigmas[0] = original_sigmas[0]
+        new_sigmas[-1] = original_sigmas[-1]
+        return torch.tensor(new_sigmas, dtype=torch.float32, device=device)
+
+    def get_interpolated_sigmas2(self, num_steps: int, device: torch.device) -> torch.Tensor:
+        original_sigmas = STAGE_2_DISTILLED_SIGMA_VALUES
+        old_x = np.linspace(0, 1, len(original_sigmas))
+        new_x = np.linspace(0, 1, num_steps + 1)
+        new_sigmas = np.interp(new_x, old_x, original_sigmas)
+        new_sigmas[0] = original_sigmas[0]
+        new_sigmas[-1] = original_sigmas[-1]
+        return torch.tensor(new_sigmas, dtype=torch.float32, device=device)
+
+    @torch.inference_mode()
     def __call__(
-        self,
-        prompt: str,
-        seed: int,
-        height: int,
-        width: int,
-        num_frames: int,
-        frame_rate: float,
-        images: list[tuple[str, int, float]],
-        tiling_config: TilingConfig | None = None,
-        enhance_prompt: bool = False,
+            self,
+            prompt: str,
+            seed: int,
+            height: int,
+            width: int,
+            num_frames: int,
+            frame_rate: float,
+            images: list[tuple[str, int, float]],
+            tiling_config: TilingConfig | None = None,
+            enhance_prompt: bool = False,
+            output_path: str = '',
+            video_chunks_number: int = 0,
+            fps: int = 0,
+            disable_audio: bool = True,
+            save_step_1_preview: bool = True,
     ) -> tuple[Iterator[torch.Tensor], torch.Tensor]:
+        print("Preparing Inference")
+        startAt = time.time()
         assert_resolution(height=height, width=width, is_two_stage=True)
 
         generator = torch.Generator(device=self.device).manual_seed(seed)
@@ -89,23 +124,67 @@ def __call__(
         stepper = EulerDiffusionStep()
         dtype = torch.bfloat16
 
-        text_encoder = self.model_ledger.text_encoder()
-        if enhance_prompt:
-            prompt = generate_enhanced_prompt(text_encoder, prompt, images[0][0] if len(images) > 0 else None)
-        context_p = encode_text(text_encoder, prompts=[prompt])[0]
-        video_context, audio_context = context_p
+        # --- PROMPT CACHE LOGIC START ---
+        CACHE_DIR = "./prompt_embeddings_cache"
+        os.makedirs(CACHE_DIR, exist_ok=True)
 
-        torch.cuda.synchronize()
-        del text_encoder
-        cleanup_memory()
+        image_identifier = images[0][0] if (len(images) > 0 and enhance_prompt) else "no_img"
+
+        hash_input_str = (
+            f"prompt:{prompt}|"
+            f"pipeline:distilled|"
+            f"enhance:{enhance_prompt}|"
+            f"seed:{seed if enhance_prompt else 'ignored'}|"
+            f"img:{image_identifier}"
+        )
+
+        cache_filename = hashlib.md5(hash_input_str.encode('utf-8')).hexdigest() + ".pt"
+        cache_path = os.path.join(CACHE_DIR, cache_filename)
+
+        context_p = None
+
+        if os.path.exists(cache_path):
+            print(f"Prompt cache hit! Loading embeddings from {cache_path}")
+            try:
+                context_p = torch.load(cache_path, map_location=self.device)
+            except Exception as e:
+                print(f"Failed to load cache (corrupted?): {e}. Regenerating.")
 
+        if context_p is None:
+            print("Prompt cache miss. Running text encoder.")
+            text_encoder = self.model_ledger.text_encoder()
+            current_prompt = prompt
+            if enhance_prompt:
+                current_prompt = generate_enhanced_prompt(
+                    text_encoder, prompt, images[0][0] if len(images) > 0 else None
+                )
+            context_p = encode_text(text_encoder, prompts=[current_prompt])[0]
+
+            print(f"Saving embeddings to {cache_path}")
+            torch.save(context_p, cache_path)
+            print("Prompt encoded.", time.time() - startAt)
+
+            torch.cuda.synchronize()
+            del text_encoder
+            cleanup_memory()
+        # --- PROMPT CACHE LOGIC END ---
+
+        video_context, audio_context = context_p
+
+        print("Stage 1: Initial low resolution video generation.")
         # Stage 1: Initial low resolution video generation.
-        video_encoder = self.model_ledger.video_encoder()
+
         transformer = self.model_ledger.transformer()
         stage_1_sigmas = torch.Tensor(DISTILLED_SIGMA_VALUES).to(self.device)
+        # stage_1_sigmas = self.get_interpolated_sigmas(16, self.device)
+
+        if not disable_audio:
+            pass
+        else:
+            audio_context = None
 
         def denoising_loop(
-            sigmas: torch.Tensor, video_state: LatentState, audio_state: LatentState, stepper: DiffusionStepProtocol
+                sigmas: torch.Tensor, video_state: LatentState, audio_state: LatentState, stepper: DiffusionStepProtocol, is_conditioning: bool = True
         ) -> tuple[LatentState, LatentState]:
             return euler_denoising_loop(
                 sigmas=sigmas,
@@ -116,9 +195,11 @@ def denoising_loop(
                     video_context=video_context,
                     audio_context=audio_context,
                     transformer=transformer,  # noqa: F821
+                    is_conditioning=is_conditioning,
+                    disable_audio=disable_audio,
                 ),
+                disable_audio=disable_audio,
             )
-
         stage_1_output_shape = VideoPixelShape(
             batch=1,
             frames=num_frames,
@@ -126,15 +207,23 @@ def denoising_loop(
             height=height // 2,
             fps=frame_rate,
         )
-        stage_1_conditionings = image_conditionings_by_replacing_latent(
-            images=images,
-            height=stage_1_output_shape.height,
-            width=stage_1_output_shape.width,
-            video_encoder=video_encoder,
-            dtype=dtype,
-            device=self.device,
-        )
-
+        stage_1_conditionings = []
+        is_conditioning = False
+        if images:
+            is_conditioning = True
+            video_encoder = self.model_ledger.video_encoder()
+            stage_1_conditionings = image_conditionings_by_replacing_latent(
+                images=images,
+                height=stage_1_output_shape.height,
+                width=stage_1_output_shape.width,
+                video_encoder=video_encoder,
+                dtype=dtype,
+                device=self.device,
+            )
+            torch.cuda.synchronize()
+            del video_encoder
+            cleanup_memory()
+        print("Stage 1: Starting denoising loop.", time.time() - startAt)
         video_state, audio_state = denoise_audio_video(
             output_shape=stage_1_output_shape,
             conditionings=stage_1_conditionings,
@@ -145,26 +234,72 @@ def denoising_loop(
             components=self.pipeline_components,
             dtype=dtype,
             device=self.device,
+            is_conditioning=is_conditioning
         )
+        print("Stage 1: Finish denoising loop.", time.time() - startAt)
+        torch.cuda.synchronize()
+        del stage_1_sigmas
+        del stage_1_output_shape
+        del stage_1_conditionings
+        cleanup_memory()
+
+        if save_step_1_preview:
+            video_decoder = self.model_ledger.video_decoder()
+            decoded_video = vae_decode_video(video_state.latent, video_decoder, tiling_config)
+            torch.cuda.synchronize()
+            del video_decoder
+            cleanup_memory()
+            if not disable_audio:
+                vocoder = self.model_ledger.vocoder()
+                decoded_audio = vae_decode_audio(
+                    audio_state.latent, self.model_ledger.audio_decoder(), vocoder
+                )
+                torch.cuda.synchronize()
+                del vocoder
+                cleanup_memory()
+            else:
+                decoded_audio = None
 
+            encode_video(
+                video=decoded_video,
+                fps=fps,
+                audio=decoded_audio,
+                audio_sample_rate=AUDIO_SAMPLE_RATE,
+                output_path=output_path.replace('.mp4', '_.mp4'),
+                video_chunks_number=video_chunks_number,
+            )
+
+        print("Stage 2: Upsample and refine the video at higher resolution with distilled LORA.", time.time() - startAt)
         # Stage 2: Upsample and refine the video at higher resolution with distilled LORA.
+        video_encoder = self.model_ledger.video_encoder()
+        upsampler = self.model_ledger.spatial_upsampler()
         upscaled_video_latent = upsample_video(
-            latent=video_state.latent[:1], video_encoder=video_encoder, upsampler=self.model_ledger.spatial_upsampler()
+            latent=video_state.latent[:1], video_encoder=video_encoder, upsampler=upsampler
         )
+        stage_2_sigmas = torch.Tensor(STAGE_2_DISTILLED_SIGMA_VALUES).to(self.device)
+        # stage_2_sigmas = self.get_interpolated_sigmas2(10, self.device)
+        stage_2_output_shape = VideoPixelShape(batch=1, frames=num_frames, width=width, height=height, fps=frame_rate)
+        stage_2_conditionings = []
+        if images:
+            stage_2_conditionings = image_conditionings_by_replacing_latent(
+                images=images,
+                height=stage_2_output_shape.height,
+                width=stage_2_output_shape.width,
+                video_encoder=video_encoder,
+                dtype=dtype,
+                device=self.device,
+            )
 
         torch.cuda.synchronize()
+        del video_encoder
+        del upsampler
+        del video_state
         cleanup_memory()
 
-        stage_2_sigmas = torch.Tensor(STAGE_2_DISTILLED_SIGMA_VALUES).to(self.device)
-        stage_2_output_shape = VideoPixelShape(batch=1, frames=num_frames, width=width, height=height, fps=frame_rate)
-        stage_2_conditionings = image_conditionings_by_replacing_latent(
-            images=images,
-            height=stage_2_output_shape.height,
-            width=stage_2_output_shape.width,
-            video_encoder=video_encoder,
-            dtype=dtype,
-            device=self.device,
-        )
+        audio_latents = None
+        if not disable_audio:
+            audio_latents = audio_state.latent
+
         video_state, audio_state = denoise_audio_video(
             output_shape=stage_2_output_shape,
             conditionings=stage_2_conditionings,
@@ -177,18 +312,32 @@ def denoising_loop(
             device=self.device,
             noise_scale=stage_2_sigmas[0],
             initial_video_latent=upscaled_video_latent,
-            initial_audio_latent=audio_state.latent,
+            initial_audio_latent=audio_latents,
+            is_conditioning=is_conditioning
         )
-
+        print("Stage 2: Finish upsample and refine the video.", time.time() - startAt)
         torch.cuda.synchronize()
         del transformer
-        del video_encoder
+        del stage_2_output_shape
+        del stage_2_conditionings
+        del stage_2_sigmas
+        cleanup_memory()
+        print("Stage 3: Starting vae decode video.", time.time() - startAt)
+        video_decoder = self.model_ledger.video_decoder()
+        decoded_video = vae_decode_video(video_state.latent, video_decoder, tiling_config)
+        del video_decoder
         cleanup_memory()
 
-        decoded_video = vae_decode_video(video_state.latent, self.model_ledger.video_decoder(), tiling_config)
-        decoded_audio = vae_decode_audio(
-            audio_state.latent, self.model_ledger.audio_decoder(), self.model_ledger.vocoder()
-        )
+        if not disable_audio:
+            vocoder = self.model_ledger.vocoder()
+            decoded_audio = vae_decode_audio(
+                audio_state.latent, self.model_ledger.audio_decoder(), vocoder
+            )
+            del vocoder
+            cleanup_memory()
+        else:
+            decoded_audio = None
+        print("Stage 3: Done.", time.time() - startAt)
         return decoded_video, decoded_audio
 
 
@@ -216,6 +365,10 @@ def main() -> None:
         images=args.images,
         tiling_config=tiling_config,
         enhance_prompt=args.enhance_prompt,
+        output_path=args.output_path,
+        video_chunks_number=video_chunks_number,
+        fps=args.frame_rate,
+        disable_audio=args.disable_audio,
     )
 
     encode_video(
diff --git a/packages/ltx-pipelines/src/ltx_pipelines/music_to_video.py b/packages/ltx-pipelines/src/ltx_pipelines/music_to_video.py
new file mode 100644
index 00000000..e69992a9
--- /dev/null
+++ b/packages/ltx-pipelines/src/ltx_pipelines/music_to_video.py
@@ -0,0 +1,483 @@
+import logging
+import time
+import hashlib
+import os
+import einops
+
+from collections.abc import Iterator
+from dataclasses import replace
+
+import torch
+import torchaudio
+
+from ltx_core.components.diffusion_steps import EulerDiffusionStep
+from ltx_core.components.noisers import GaussianNoiser
+from ltx_core.components.protocols import DiffusionStepProtocol
+from ltx_core.loader import LoraPathStrengthAndSDOps
+from ltx_core.model.audio_vae import decode_audio as vae_decode_audio
+from ltx_core.model.upsampler import upsample_video
+from ltx_core.model.video_vae import TilingConfig, get_video_chunks_number
+from ltx_core.model.video_vae import decode_video as vae_decode_video
+from ltx_core.text_encoders.gemma import encode_text
+from ltx_core.types import AudioLatentShape, LatentState, VideoPixelShape
+from ltx_pipelines.utils import ModelLedger
+from ltx_pipelines.utils.args import default_2_stage_distilled_arg_parser
+from ltx_pipelines.utils.constants import (
+    AUDIO_SAMPLE_RATE,
+    DISTILLED_SIGMA_VALUES,
+    STAGE_2_DISTILLED_SIGMA_VALUES,
+)
+from ltx_pipelines.utils.helpers import (
+    assert_resolution,
+    cleanup_memory,
+    denoise_audio_video,
+    euler_denoising_loop,
+    generate_enhanced_prompt,
+    get_device,
+    image_conditionings_by_replacing_latent,
+    simple_denoising_func,
+    noise_audio_state
+)
+from ltx_pipelines.utils.media_io import encode_video
+from ltx_pipelines.utils.types import PipelineComponents
+
+device = get_device()
+
+logging.basicConfig(level=logging.ERROR)
+logging.getLogger("accelerate").setLevel(logging.ERROR)
+logging.getLogger("ltx_core").setLevel(logging.ERROR)
+
+
+def load_audio_input(audio_path: str, target_sample_rate: int, device: torch.device) -> torch.Tensor:
+    waveform, sample_rate = torchaudio.load(audio_path)
+    if sample_rate != target_sample_rate:
+        waveform = torchaudio.functional.resample(waveform, sample_rate, target_sample_rate)
+    
+    return waveform.to(device)
+
+
+class MusicToVideoPipeline:
+    """
+    Modified DistilledPipeline for Music-to-Video generation.
+    Takes an input audio file, encodes it, and uses it to condition/guide the video generation.
+    """
+
+    def __init__(
+        self,
+        checkpoint_path: str,
+        gemma_root: str,
+        spatial_upsampler_path: str,
+        loras: list[LoraPathStrengthAndSDOps],
+        device: torch.device = device,
+        fp8transformer: bool = False,
+    ):
+        self.device = device
+        self.dtype = torch.bfloat16
+
+        self.model_ledger = ModelLedger(
+            dtype=self.dtype,
+            device=device,
+            checkpoint_path=checkpoint_path,
+            spatial_upsampler_path=spatial_upsampler_path,
+            gemma_root_path=gemma_root,
+            loras=loras,
+            fp8transformer=fp8transformer,
+        )
+
+        self.pipeline_components = PipelineComponents(
+            dtype=self.dtype,
+            device=device,
+        )
+
+    def encode_audio_latents(self, waveform: torch.Tensor) -> torch.Tensor:
+        """
+        Encodes the audio waveform into latents using the VAE encoder.
+        """
+        from ltx_core.model.audio_vae.ops import AudioProcessor
+
+        n_fft = 1024
+        mel_hop_length = 160
+        mel_bins = 64
+        
+        audio_processor = AudioProcessor(
+            sample_rate=24000,
+            mel_bins=mel_bins,
+            mel_hop_length=mel_hop_length,
+            n_fft=n_fft
+        ).to(self.device)
+
+        if waveform.dim() == 1:
+            waveform = waveform.unsqueeze(0).unsqueeze(0)
+        elif waveform.dim() == 2:
+            waveform = waveform.unsqueeze(0)
+
+        if waveform.shape[1] == 1:
+            waveform = waveform.repeat(1, 2, 1)
+        elif waveform.shape[1] > 2:
+            waveform = waveform[:, :2, :]
+            
+        spectrogram = audio_processor.waveform_to_mel(waveform.to(self.device).float(), 24000)
+        audio_encoder = self.model_ledger.audio_encoder()
+        encoded_latents = audio_encoder(spectrogram.to(self.dtype))
+
+        del audio_encoder
+        del audio_processor
+        cleanup_memory()
+        
+        return encoded_latents
+
+    @torch.inference_mode()
+    def __call__(
+            self,
+            prompt: str,
+            seed: int,
+            height: int,
+            width: int,
+            num_frames: int,
+            frame_rate: float,
+            images: list[tuple[str, int, float]],
+            audio_input_path: str | None = None,
+            tiling_config: TilingConfig | None = None,
+            enhance_prompt: bool = False,
+            output_path: str = '',
+            video_chunks_number: int = 0,
+            fps: int = 0,
+            save_step_1_preview: bool = True,
+    ) -> tuple[Iterator[torch.Tensor], torch.Tensor | None]:
+        print("Preparing Inference (Music to Video)")
+        startAt = time.time()
+        assert_resolution(height=height, width=width, is_two_stage=True)
+
+        generator = torch.Generator(device=self.device).manual_seed(seed)
+        noiser = GaussianNoiser(generator=generator)
+        stepper = EulerDiffusionStep()
+        dtype = torch.bfloat16
+
+        # --- LOAD AUDIO ---
+        audio_waveform = None
+        audio_latents = None
+        
+        if audio_input_path:
+             print(f"Loading audio from {audio_input_path}")
+
+             audio_waveform = load_audio_input(audio_input_path, 24000, self.device)
+             print("Encoding audio latents...")
+             audio_latents = self.encode_audio_latents(audio_waveform)
+             pass
+
+        # --- PROMPT CACHE LOGIC START ---
+        CACHE_DIR = "./prompt_embeddings_cache"
+        os.makedirs(CACHE_DIR, exist_ok=True)
+
+        image_identifier = images[0][0] if (len(images) > 0 and enhance_prompt) else "no_img"
+
+        hash_input_str = (
+            f"prompt:{prompt}|"
+            f"pipeline:music_distilled|"
+            f"enhance:{enhance_prompt}|"
+            f"seed:{seed if enhance_prompt else 'ignored'}|"
+            f"img:{image_identifier}"
+        )
+
+        cache_filename = hashlib.md5(hash_input_str.encode('utf-8')).hexdigest() + ".pt"
+        cache_path = os.path.join(CACHE_DIR, cache_filename)
+
+        context_p = None
+
+        if os.path.exists(cache_path):
+            print(f"Prompt cache hit! Loading embeddings from {cache_path}")
+            try:
+                context_p = torch.load(cache_path, map_location=self.device)
+            except Exception as e:
+                print(f"Failed to load cache (corrupted?): {e}. Regenerating.")
+
+        if context_p is None:
+            print("Prompt cache miss. Running text encoder.")
+            text_encoder = self.model_ledger.text_encoder()
+            current_prompt = prompt
+            if enhance_prompt:
+                current_prompt = generate_enhanced_prompt(
+                    text_encoder, prompt, images[0][0] if len(images) > 0 else None
+                )
+            context_p = encode_text(text_encoder, prompts=[current_prompt])[0]
+
+            print(f"Saving embeddings to {cache_path}")
+            torch.save(context_p, cache_path)
+            print("Prompt encoded.", time.time() - startAt)
+
+            torch.cuda.synchronize()
+            del text_encoder
+            cleanup_memory()
+        
+        video_context, audio_context = context_p
+
+        print("Stage 1: Initial low resolution video generation.")
+
+        transformer = self.model_ledger.transformer()
+        stage_1_sigmas = torch.Tensor(DISTILLED_SIGMA_VALUES).to(self.device)
+
+        def music_denoising_loop(
+                sigmas: torch.Tensor, video_state: LatentState, audio_state: LatentState, stepper: DiffusionStepProtocol, is_conditioning: bool = True
+        ) -> tuple[LatentState, LatentState]:
+            
+            v_x = video_state
+            a_x = audio_state
+            
+            for i in range(len(sigmas) - 1):
+                sigma_hat = sigmas[i]
+                sigma_next = sigmas[i + 1]
+                if loop_audio_latents is not None:
+                     a_x = replace(a_x, latent=loop_audio_latents)
+                     pass
+
+                denoised_v, denoised_a = simple_denoising_func(
+                    video_context=video_context,
+                    audio_context=audio_context,
+                    transformer=transformer,
+                    is_conditioning=is_conditioning,
+                    disable_audio=False,
+                )(v_x, a_x, sigmas, i)
+
+                d_v = (v_x.latent - denoised_v) / sigma_hat
+                dt = sigma_next - sigma_hat
+                v_x = replace(v_x, latent=v_x.latent + d_v * dt)
+            
+            return v_x, a_x
+
+        stage_1_output_shape = VideoPixelShape(
+            batch=1,
+            frames=num_frames,
+            width=width // 2,
+            height=height // 2,
+            fps=frame_rate,
+        )
+
+        stage_1_conditionings = []
+        is_conditioning = False
+        if images:
+            is_conditioning = True
+            video_encoder = self.model_ledger.video_encoder()
+            stage_1_conditionings = image_conditionings_by_replacing_latent(
+                images=images,
+                height=stage_1_output_shape.height,
+                width=stage_1_output_shape.width,
+                video_encoder=video_encoder,
+                dtype=dtype,
+                device=self.device,
+            )
+            torch.cuda.synchronize()
+            del video_encoder
+            cleanup_memory()
+
+        if audio_latents is not None:
+             expected_audio_shape = AudioLatentShape.from_video_pixel_shape(stage_1_output_shape)
+             target_frames = expected_audio_shape.frames
+             current_frames = audio_latents.shape[2]
+             
+             if current_frames > target_frames:
+                 print(f"Aligning audio: Trimming from {current_frames} to {target_frames}")
+                 audio_latents = audio_latents[:, :, :target_frames, :]
+             elif current_frames < target_frames:
+                 print(f"Aligning audio: Padding from {current_frames} to {target_frames}")
+                 pad_amount = target_frames - current_frames
+                 audio_latents = torch.nn.functional.pad(audio_latents, (0, 0, 0, pad_amount))
+        
+        if audio_latents is not None:
+             audio_state, audio_tools = noise_audio_state(
+                 stage_1_output_shape,
+                 noiser,
+                 [], # Audio has no image-based latents
+                 self.pipeline_components,
+                 self.dtype,
+                 self.device,
+                 noise_scale=1.0,
+                 initial_latent=audio_latents,
+             )
+
+             loop_audio_latents = einops.rearrange(audio_latents, "b c t f -> b t (c f)")
+             in_features = transformer.velocity_model.audio_patchify_proj.in_features
+             if loop_audio_latents.shape[-1] != in_features:
+                 print(f"Aligning audio features for loop: {loop_audio_latents.shape[-1]} -> {in_features}")
+                 loop_audio_latents = loop_audio_latents[..., :in_features]
+        else:
+            audio_state, audio_tools = noise_audio_state(
+                stage_1_output_shape,
+                noiser,
+                stage_1_conditionings,
+                self.pipeline_components,
+                self.dtype,
+                self.device,
+                noise_scale=1.0,
+            )
+            loop_audio_latents = None
+
+        print("Stage 1: Starting denoising loop.", time.time() - startAt)
+        
+        video_state, audio_state = denoise_audio_video(
+            output_shape=stage_1_output_shape,
+            conditionings=stage_1_conditionings,
+            noiser=noiser,
+            sigmas=stage_1_sigmas,
+            stepper=stepper,
+            denoising_loop_fn=music_denoising_loop,
+            components=self.pipeline_components,
+            dtype=dtype,
+            device=self.device,
+            is_conditioning=is_conditioning,
+            initial_audio_latent=audio_latents if audio_latents is not None else None
+        )
+        
+        print("Stage 1: Finish denoising loop.", time.time() - startAt)
+        
+        torch.cuda.synchronize()
+        del stage_1_sigmas
+        del stage_1_output_shape
+        del stage_1_conditionings
+        cleanup_memory()
+
+        if save_step_1_preview:
+            video_decoder = self.model_ledger.video_decoder()
+            decoded_video = vae_decode_video(video_state.latent, video_decoder, tiling_config)
+            torch.cuda.synchronize()
+            del video_decoder
+            cleanup_memory()
+            
+            decoded_audio = None
+            if audio_latents is not None:
+                vocoder = self.model_ledger.vocoder()
+                decoded_audio = vae_decode_audio(
+                    audio_state.latent, self.model_ledger.audio_decoder(), vocoder
+                )
+                torch.cuda.synchronize()
+                del vocoder
+                cleanup_memory()
+            
+            encode_video(
+                video=decoded_video,
+                fps=fps,
+                audio=audio_waveform.cpu() if audio_waveform is not None else decoded_audio,
+                audio_sample_rate=AUDIO_SAMPLE_RATE,
+                output_path=output_path.replace('.mp4', '_.mp4'),
+                video_chunks_number=video_chunks_number,
+            )
+
+        print("Stage 2: Upsample and refine.")
+        
+        video_encoder = self.model_ledger.video_encoder()
+        upsampler = self.model_ledger.spatial_upsampler()
+        upscaled_video_latent = upsample_video(
+            latent=video_state.latent[:1], video_encoder=video_encoder, upsampler=upsampler
+        )
+        stage_2_sigmas = torch.Tensor(STAGE_2_DISTILLED_SIGMA_VALUES).to(self.device)
+        
+        stage_2_output_shape = VideoPixelShape(batch=1, frames=num_frames, width=width, height=height, fps=frame_rate)
+        stage_2_conditionings = []
+        if images:
+             stage_2_conditionings = image_conditionings_by_replacing_latent(
+                images=images,
+                height=stage_2_output_shape.height,
+                width=stage_2_output_shape.width,
+                video_encoder=video_encoder,
+                dtype=dtype,
+                device=self.device,
+            )
+
+        torch.cuda.synchronize()
+        del video_encoder
+        del upsampler
+        del video_state
+        cleanup_memory()
+
+        video_state, audio_state = denoise_audio_video(
+            output_shape=stage_2_output_shape,
+            conditionings=stage_2_conditionings,
+            noiser=noiser,
+            sigmas=stage_2_sigmas,
+            stepper=stepper,
+            denoising_loop_fn=music_denoising_loop,
+            components=self.pipeline_components,
+            dtype=dtype,
+            device=self.device,
+            noise_scale=stage_2_sigmas[0],
+            initial_video_latent=upscaled_video_latent,
+            initial_audio_latent=audio_latents,
+            is_conditioning=is_conditioning
+        )
+        
+        print("Stage 2: Finish upsample.", time.time() - startAt)
+        
+        torch.cuda.synchronize()
+        del transformer
+        del stage_2_output_shape
+        del stage_2_conditionings
+        del stage_2_sigmas
+        cleanup_memory()
+        
+        print("Stage 3: VAE Decode.")
+        video_decoder = self.model_ledger.video_decoder()
+        decoded_video = vae_decode_video(video_state.latent, video_decoder, tiling_config)
+        del video_decoder
+        cleanup_memory()
+    
+        if audio_state is not None:
+            pass
+
+        audio = None
+        if audio_state is not None:
+            if audio_latents is not None:
+                 vocoder = self.model_ledger.vocoder()
+                 audio = vae_decode_audio(audio_latents, self.model_ledger.audio_decoder(), vocoder)
+                 torch.cuda.synchronize()
+                 del vocoder
+                 cleanup_memory()
+            else:
+                 pass
+        
+        print("Stage 3: Done.", time.time() - startAt)
+        return decoded_video, audio_waveform.cpu() if audio_waveform is not None else audio
+
+
+@torch.inference_mode()
+def main() -> None:
+    logging.getLogger().setLevel(logging.INFO)
+    parser = default_2_stage_distilled_arg_parser()
+    parser.add_argument("--audio-input-path", type=str, default=None, help="Path to input audio file")
+    args = parser.parse_args()
+    
+    pipeline = MusicToVideoPipeline(
+        checkpoint_path=args.checkpoint_path,
+        spatial_upsampler_path=args.spatial_upsampler_path,
+        gemma_root=args.gemma_root,
+        loras=args.lora,
+        fp8transformer=args.enable_fp8,
+    )
+    tiling_config = TilingConfig.default()
+    video_chunks_number = get_video_chunks_number(args.num_frames, tiling_config)
+    
+    video, audio = pipeline(
+        prompt=args.prompt,
+        seed=args.seed,
+        height=args.height,
+        width=args.width,
+        num_frames=args.num_frames,
+        frame_rate=args.frame_rate,
+        images=args.images,
+        audio_input_path=args.audio_input_path,
+        tiling_config=tiling_config,
+        enhance_prompt=args.enhance_prompt,
+        output_path=args.output_path,
+        video_chunks_number=video_chunks_number,
+        fps=args.frame_rate,
+    )
+
+    encode_video(
+        video=video,
+        fps=args.frame_rate,
+        audio=audio,
+        audio_sample_rate=AUDIO_SAMPLE_RATE,
+        output_path=args.output_path,
+        video_chunks_number=video_chunks_number,
+    )
+
+if __name__ == "__main__":
+    main()
diff --git a/packages/ltx-pipelines/src/ltx_pipelines/ti2vid_two_stages.py b/packages/ltx-pipelines/src/ltx_pipelines/ti2vid_two_stages.py
index b835bfe5..4d35ff45 100644
--- a/packages/ltx-pipelines/src/ltx_pipelines/ti2vid_two_stages.py
+++ b/packages/ltx-pipelines/src/ltx_pipelines/ti2vid_two_stages.py
@@ -1,4 +1,5 @@
 import logging
+import time
 from collections.abc import Iterator
 
 import torch
@@ -35,6 +36,8 @@
 from ltx_pipelines.utils.media_io import encode_video
 from ltx_pipelines.utils.types import PipelineComponents
 
+from line_profiler import profile
+
 device = get_device()
 
 
@@ -56,6 +59,8 @@ def __init__(
         device: str = device,
         fp8transformer: bool = False,
     ):
+        print("Start Init")
+        startAt = time.time()
         self.device = device
         self.dtype = torch.bfloat16
         self.stage_1_model_ledger = ModelLedger(
@@ -76,9 +81,10 @@ def __init__(
             dtype=self.dtype,
             device=device,
         )
+        print("End Init", time.time() - startAt)
 
     @torch.inference_mode()
-    def __call__(  # noqa: PLR0913
+    def old__call__(  # noqa: PLR0913
         self,
         prompt: str,
         negative_prompt: str,
@@ -93,6 +99,8 @@ def __call__(  # noqa: PLR0913
         tiling_config: TilingConfig | None = None,
         enhance_prompt: bool = False,
     ) -> tuple[Iterator[torch.Tensor], torch.Tensor]:
+        print("Start Call")
+        startAt = time.time()
         assert_resolution(height=height, width=width, is_two_stage=True)
 
         generator = torch.Generator(device=self.device).manual_seed(seed)
@@ -100,7 +108,7 @@ def __call__(  # noqa: PLR0913
         stepper = EulerDiffusionStep()
         cfg_guider = CFGGuider(cfg_guidance_scale)
         dtype = torch.bfloat16
-
+        print("starting text encoder", time.time() - startAt)
         text_encoder = self.stage_1_model_ledger.text_encoder()
         if enhance_prompt:
             prompt = generate_enhanced_prompt(
@@ -109,11 +117,12 @@ def __call__(  # noqa: PLR0913
         context_p, context_n = encode_text(text_encoder, prompts=[prompt, negative_prompt])
         v_context_p, a_context_p = context_p
         v_context_n, a_context_n = context_n
+        print("end text encoder", time.time() - startAt)
 
         torch.cuda.synchronize()
         del text_encoder
         cleanup_memory()
-
+        print("Stage 1: Initial low resolution video generation.", time.time() - startAt)
         # Stage 1: Initial low resolution video generation.
         video_encoder = self.stage_1_model_ledger.video_encoder()
         transformer = self.stage_1_model_ledger.transformer()
@@ -152,6 +161,7 @@ def first_stage_denoising_loop(
             dtype=dtype,
             device=self.device,
         )
+        print("Stage 1: Starting denoising loop.", time.time() - startAt)
         video_state, audio_state = denoise_audio_video(
             output_shape=stage_1_output_shape,
             conditionings=stage_1_conditionings,
@@ -163,17 +173,19 @@ def first_stage_denoising_loop(
             dtype=dtype,
             device=self.device,
         )
-
+        print("Stage 1: End denoising loop.", time.time() - startAt)
         torch.cuda.synchronize()
         del transformer
         cleanup_memory()
 
+        print("Stage 2: Upsample and refine the video at higher resolution with distilled LORA.", time.time() - startAt)
         # Stage 2: Upsample and refine the video at higher resolution with distilled LORA.
         upscaled_video_latent = upsample_video(
             latent=video_state.latent[:1],
             video_encoder=video_encoder,
             upsampler=self.stage_2_model_ledger.spatial_upsampler(),
         )
+        print("Stage 2: Upsample and refine the video end.", time.time() - startAt)
 
         torch.cuda.synchronize()
         cleanup_memory()
@@ -219,7 +231,7 @@ def second_stage_denoising_loop(
             initial_video_latent=upscaled_video_latent,
             initial_audio_latent=audio_state.latent,
         )
-
+        print("Stage 2: Upsample and refine the video end.", time.time() - startAt)
         torch.cuda.synchronize()
         del transformer
         del video_encoder
@@ -229,7 +241,219 @@ def second_stage_denoising_loop(
         decoded_audio = vae_decode_audio(
             audio_state.latent, self.stage_2_model_ledger.audio_decoder(), self.stage_2_model_ledger.vocoder()
         )
+        print("Stage 2:vae decode video end.", time.time() - startAt)
+        return decoded_video, decoded_audio
+
+    @torch.inference_mode()
+    def __call__(  # noqa: PLR0913
+            self,
+            prompt: str,
+            negative_prompt: str,
+            seed: int,
+            height: int,
+            width: int,
+            num_frames: int,
+            frame_rate: float,
+            num_inference_steps: int,
+            cfg_guidance_scale: float,
+            images: list[tuple[str, int, float]],
+            tiling_config: TilingConfig | None = None,
+            enhance_prompt: bool = False,
+    ) -> tuple[Iterator[torch.Tensor], torch.Tensor]:
+        import hashlib
+        import os
+
+        print("Start Call")
+        startAt = time.time()
+        assert_resolution(height=height, width=width, is_two_stage=True)
+
+        generator = torch.Generator(device=self.device).manual_seed(seed)
+        noiser = GaussianNoiser(generator=generator)
+        stepper = EulerDiffusionStep()
+        cfg_guider = CFGGuider(cfg_guidance_scale)
+        dtype = torch.bfloat16
+        print("starting text encoder", time.time() - startAt)
+
+        # --- DISK CACHE LOGIC START ---
+        CACHE_DIR = "./prompt_embeddings_cache"
+        os.makedirs(CACHE_DIR, exist_ok=True)
+
+        # 1. Create a unique hash string based on inputs that affect text encoding
+        # Note: We only include seed and image if enhance_prompt is True,
+        # because otherwise, they don't change the text embedding.
+        image_identifier = images[0][0] if (len(images) > 0 and enhance_prompt) else "no_img"
+
+        hash_input_str = (
+            f"prompt:{prompt}|"
+            f"neg:{negative_prompt}|"
+            f"enhance:{enhance_prompt}|"
+            f"seed:{seed if enhance_prompt else 'ignored'}|"
+            f"img:{image_identifier}"
+        )
+
+        # Create MD5 hash for filename
+        cache_filename = hashlib.md5(hash_input_str.encode('utf-8')).hexdigest() + ".pt"
+        cache_path = os.path.join(CACHE_DIR, cache_filename)
+
+        context_p = None
+        context_n = None
+
+        if os.path.exists(cache_path):
+            print(f"Disk cache hit! Loading embeddings from {cache_path}")
+            # Load directly to the correct device
+            try:
+                cached_data = torch.load(cache_path, map_location=self.device)
+                context_p, context_n = cached_data
+            except Exception as e:
+                print(f"Failed to load cache (corrupted?): {e}. Regenerating.")
+
+        # If cache miss or load failed
+        if context_p is None:
+            print("Disk cache miss. Running text encoder.")
+            text_encoder = self.stage_1_model_ledger.text_encoder()
+
+            # Logic to handle prompt enhancement
+            current_prompt = prompt
+            if enhance_prompt:
+                current_prompt = generate_enhanced_prompt(
+                    text_encoder, prompt, images[0][0] if len(images) > 0 else None, seed=seed
+                )
+
+            context_p, context_n = encode_text(text_encoder, prompts=[current_prompt, negative_prompt])
+
+            # Save to disk for next time
+            print(f"Saving embeddings to {cache_path}")
+            torch.save((context_p, context_n), cache_path)
+
+            torch.cuda.synchronize()
+            del text_encoder
+            cleanup_memory()
+        # --- DISK CACHE LOGIC END ---
+
+        v_context_p, a_context_p = context_p
+        v_context_n, a_context_n = context_n
+        print("end text encoder", time.time() - startAt)
+
+        print("Stage 1: Initial low resolution video generation.", time.time() - startAt)
+        # Stage 1: Initial low resolution video generation.
+        video_encoder = self.stage_1_model_ledger.video_encoder()
+        transformer = self.stage_1_model_ledger.transformer()
+        sigmas = LTX2Scheduler().execute(steps=num_inference_steps).to(dtype=torch.float32, device=self.device)
+
+        def first_stage_denoising_loop(
+                sigmas: torch.Tensor, video_state: LatentState, audio_state: LatentState, stepper: DiffusionStepProtocol
+        ) -> tuple[LatentState, LatentState]:
+            return euler_denoising_loop(
+                sigmas=sigmas,
+                video_state=video_state,
+                audio_state=audio_state,
+                stepper=stepper,
+                denoise_fn=guider_denoising_func(
+                    cfg_guider,
+                    v_context_p,
+                    v_context_n,
+                    a_context_p,
+                    a_context_n,
+                    transformer=transformer,  # noqa: F821
+                ),
+            )
+
+        stage_1_output_shape = VideoPixelShape(
+            batch=1,
+            frames=num_frames,
+            width=width // 2,
+            height=height // 2,
+            fps=frame_rate,
+        )
+        stage_1_conditionings = image_conditionings_by_replacing_latent(
+            images=images,
+            height=stage_1_output_shape.height,
+            width=stage_1_output_shape.width,
+            video_encoder=video_encoder,
+            dtype=dtype,
+            device=self.device,
+        )
+        print("Stage 1: Starting denoising loop.", time.time() - startAt)
+        video_state, audio_state = denoise_audio_video(
+            output_shape=stage_1_output_shape,
+            conditionings=stage_1_conditionings,
+            noiser=noiser,
+            sigmas=sigmas,
+            stepper=stepper,
+            denoising_loop_fn=first_stage_denoising_loop,
+            components=self.pipeline_components,
+            dtype=dtype,
+            device=self.device,
+        )
+        print("Stage 1: End denoising loop.", time.time() - startAt)
+        #torch.cuda.synchronize()
+        del transformer
+        cleanup_memory()
+
+        print("Stage 2: Upsample and refine the video at higher resolution with distilled LORA.", time.time() - startAt)
+        # Stage 2: Upsample and refine the video at higher resolution with distilled LORA.
+        upscaled_video_latent = upsample_video(
+            latent=video_state.latent[:1],
+            video_encoder=video_encoder,
+            upsampler=self.stage_2_model_ledger.spatial_upsampler(),
+        )
+        print("Stage 2: Upsample and refine the video end.", time.time() - startAt)
+
+        #torch.cuda.synchronize()
+        cleanup_memory()
 
+        transformer = self.stage_2_model_ledger.transformer()
+        distilled_sigmas = torch.Tensor(STAGE_2_DISTILLED_SIGMA_VALUES).to(self.device)
+
+        def second_stage_denoising_loop(
+                sigmas: torch.Tensor, video_state: LatentState, audio_state: LatentState, stepper: DiffusionStepProtocol
+        ) -> tuple[LatentState, LatentState]:
+            return euler_denoising_loop(
+                sigmas=sigmas,
+                video_state=video_state,
+                audio_state=audio_state,
+                stepper=stepper,
+                denoise_fn=simple_denoising_func(
+                    video_context=v_context_p,
+                    audio_context=a_context_p,
+                    transformer=transformer,  # noqa: F821
+                ),
+            )
+
+        stage_2_output_shape = VideoPixelShape(batch=1, frames=num_frames, width=width, height=height, fps=frame_rate)
+        stage_2_conditionings = image_conditionings_by_replacing_latent(
+            images=images,
+            height=stage_2_output_shape.height,
+            width=stage_2_output_shape.width,
+            video_encoder=video_encoder,
+            dtype=dtype,
+            device=self.device,
+        )
+        video_state, audio_state = denoise_audio_video(
+            output_shape=stage_2_output_shape,
+            conditionings=stage_2_conditionings,
+            noiser=noiser,
+            sigmas=distilled_sigmas,
+            stepper=stepper,
+            denoising_loop_fn=second_stage_denoising_loop,
+            components=self.pipeline_components,
+            dtype=dtype,
+            device=self.device,
+            noise_scale=distilled_sigmas[0],
+            initial_video_latent=upscaled_video_latent,
+            initial_audio_latent=audio_state.latent,
+        )
+        print("Stage 2: Upsample and refine the video end.", time.time() - startAt)
+        #torch.cuda.synchronize()
+        del transformer
+        del video_encoder
+        cleanup_memory()
+
+        decoded_video = vae_decode_video(video_state.latent, self.stage_2_model_ledger.video_decoder(), tiling_config)
+        decoded_audio = vae_decode_audio(
+            audio_state.latent, self.stage_2_model_ledger.audio_decoder(), self.stage_2_model_ledger.vocoder()
+        )
+        print("Stage 2:vae decode video end.", time.time() - startAt)
         return decoded_video, decoded_audio
 
 
diff --git a/packages/ltx-pipelines/src/ltx_pipelines/utils/args.py b/packages/ltx-pipelines/src/ltx_pipelines/utils/args.py
index c5bb8f58..ccd49b63 100644
--- a/packages/ltx-pipelines/src/ltx_pipelines/utils/args.py
+++ b/packages/ltx-pipelines/src/ltx_pipelines/utils/args.py
@@ -180,6 +180,7 @@ def basic_arg_parser() -> argparse.ArgumentParser:
         "Note that calculations are still performed in bfloat16 precision.",
     )
     parser.add_argument("--enhance-prompt", action="store_true")
+    parser.add_argument("--disable-audio", action="store_true")
     return parser
 
 
diff --git a/packages/ltx-pipelines/src/ltx_pipelines/utils/helpers.py b/packages/ltx-pipelines/src/ltx_pipelines/utils/helpers.py
index 867db18f..bc7891a4 100644
--- a/packages/ltx-pipelines/src/ltx_pipelines/utils/helpers.py
+++ b/packages/ltx-pipelines/src/ltx_pipelines/utils/helpers.py
@@ -90,13 +90,14 @@ def image_conditionings_by_adding_guiding_latent(
         )
     return conditionings
 
-
+#@profile 293.788 s
 def euler_denoising_loop(
     sigmas: torch.Tensor,
     video_state: LatentState,
-    audio_state: LatentState,
+    audio_state: LatentState | None,
     stepper: DiffusionStepProtocol,
     denoise_fn: DenoisingFunc,
+    disable_audio: bool = False
 ) -> tuple[LatentState, LatentState]:
     """
     Perform the joint audio-video denoising loop over a diffusion schedule.
@@ -130,17 +131,19 @@ def euler_denoising_loop(
         audio latent states after completing the denoising loop.
     """
     for step_idx, _ in enumerate(tqdm(sigmas[:-1])):
-        denoised_video, denoised_audio = denoise_fn(video_state, audio_state, sigmas, step_idx)
+        denoised_video, denoised_audio = denoise_fn(video_state, audio_state, sigmas, step_idx)  # 100%
 
         denoised_video = post_process_latent(denoised_video, video_state.denoise_mask, video_state.clean_latent)
-        denoised_audio = post_process_latent(denoised_audio, audio_state.denoise_mask, audio_state.clean_latent)
+        if not disable_audio:
+            denoised_audio = post_process_latent(denoised_audio, audio_state.denoise_mask, audio_state.clean_latent)
 
         video_state = replace(video_state, latent=stepper.step(video_state.latent, denoised_video, sigmas, step_idx))
-        audio_state = replace(audio_state, latent=stepper.step(audio_state.latent, denoised_audio, sigmas, step_idx))
+        if not disable_audio:
+            audio_state = replace(audio_state, latent=stepper.step(audio_state.latent, denoised_audio, sigmas, step_idx))
 
     return (video_state, audio_state)
 
-
+#@profile unused
 def gradient_estimating_euler_denoising_loop(
     sigmas: torch.Tensor,
     video_state: LatentState,
@@ -200,7 +203,7 @@ def update_velocity_and_sample(
 
     return (video_state, audio_state)
 
-
+#@profile 0.13212 s
 def noise_video_state(
     output_shape: VideoPixelShape,
     noiser: Noiser,
@@ -223,7 +226,7 @@ def noise_video_state(
         scale_factors=components.video_scale_factors,
     )
     video_tools = VideoLatentTools(components.video_patchifier, video_latent_shape, output_shape.fps)
-    video_state = create_noised_state(
+    video_state = create_noised_state(  # 99.9%
         tools=video_tools,
         conditionings=conditionings,
         noiser=noiser,
@@ -235,7 +238,7 @@ def noise_video_state(
 
     return video_state, video_tools
 
-
+#@profile 0.0061205 s
 def noise_audio_state(
     output_shape: VideoPixelShape,
     noiser: Noiser,
@@ -266,7 +269,7 @@ def noise_audio_state(
 
     return audio_state, audio_tools
 
-
+#@profile 0.138001 s
 def create_noised_state(
     tools: LatentTools,
     conditionings: list[ConditioningItem],
@@ -280,13 +283,13 @@ def create_noised_state(
     Creates an empty latent state, applies conditionings, and then adds noise
     using the provided noiser. Returns the final noised state ready for diffusion.
     """
-    state = tools.create_initial_state(device, dtype, initial_latent)
+    state = tools.create_initial_state(device, dtype, initial_latent)  # 92.1%
     state = state_with_conditionings(state, conditionings, tools)
     state = noiser(state, noise_scale)
 
     return state
 
-
+#@profile 8.7e-06 s
 def state_with_conditionings(
     latent_state: LatentState, conditioning_items: list[ConditioningItem], latent_tools: LatentTools
 ) -> LatentState:
@@ -299,12 +302,12 @@ def state_with_conditionings(
 
     return latent_state
 
-
+#@profile 0.0054349 s
 def post_process_latent(denoised: torch.Tensor, denoise_mask: torch.Tensor, clean: torch.Tensor) -> torch.Tensor:
     """Blend denoised output with clean state based on mask."""
     return (denoised * denoise_mask + clean.float() * (1 - denoise_mask)).to(denoised.dtype)
 
-
+#@profile 0.0025289 s
 def modality_from_latent_state(
     state: LatentState, context: torch.Tensor, sigma: float | torch.Tensor, enabled: bool = True
 ) -> Modality:
@@ -321,7 +324,7 @@ def modality_from_latent_state(
         context_mask=None,
     )
 
-
+#@profile 0.0011484 s
 def timesteps_from_mask(denoise_mask: torch.Tensor, sigma: float | torch.Tensor) -> torch.Tensor:
     """Compute timesteps from a denoise mask and sigma value.
     Multiplies the denoise mask by sigma to produce timesteps for each position
@@ -329,23 +332,26 @@ def timesteps_from_mask(denoise_mask: torch.Tensor, sigma: float | torch.Tensor)
     """
     return denoise_mask * sigma
 
-
+#@profile
 def simple_denoising_func(
-    video_context: torch.Tensor, audio_context: torch.Tensor, transformer: X0Model
+    video_context: torch.Tensor, audio_context: torch.Tensor | None, transformer: X0Model, is_conditioning: bool = True, disable_audio: bool = False
 ) -> DenoisingFunc:
     def simple_denoising_step(
         video_state: LatentState, audio_state: LatentState, sigmas: torch.Tensor, step_index: int
     ) -> tuple[torch.Tensor, torch.Tensor]:
         sigma = sigmas[step_index]
         pos_video = modality_from_latent_state(video_state, video_context, sigma)
-        pos_audio = modality_from_latent_state(audio_state, audio_context, sigma)
+        if disable_audio:
+            pos_audio = None
+        else:
+            pos_audio = modality_from_latent_state(audio_state, audio_context, sigma)
 
-        denoised_video, denoised_audio = transformer(video=pos_video, audio=pos_audio, perturbations=None)
+        denoised_video, denoised_audio = transformer(video=pos_video, audio=pos_audio, perturbations=None, is_conditioning=is_conditioning)  # 100%
         return denoised_video, denoised_audio
 
     return simple_denoising_step
 
-
+#@profile unused
 def guider_denoising_func(
     guider: GuiderProtocol,
     v_context_p: torch.Tensor,
@@ -375,7 +381,7 @@ def guider_denoising_step(
 
     return guider_denoising_step
 
-
+#@profile 293.957 s
 def denoise_audio_video(  # noqa: PLR0913
     output_shape: VideoPixelShape,
     conditionings: list[ConditioningItem],
@@ -389,6 +395,7 @@ def denoise_audio_video(  # noqa: PLR0913
     noise_scale: float = 1.0,
     initial_video_latent: torch.Tensor | None = None,
     initial_audio_latent: torch.Tensor | None = None,
+    is_conditioning: bool = True
 ) -> tuple[LatentState, LatentState]:
     video_state, video_tools = noise_video_state(
         output_shape=output_shape,
@@ -411,11 +418,12 @@ def denoise_audio_video(  # noqa: PLR0913
         initial_latent=initial_audio_latent,
     )
 
-    video_state, audio_state = denoising_loop_fn(
+    video_state, audio_state = denoising_loop_fn(  # 99.9%
         sigmas,
         video_state,
         audio_state,
         stepper,
+        is_conditioning,
     )
 
     video_state = video_tools.clear_conditioning(video_state)
diff --git a/packages/ltx-pipelines/src/ltx_pipelines/utils/model_ledger.py b/packages/ltx-pipelines/src/ltx_pipelines/utils/model_ledger.py
index c507ff4c..cbb5bc9e 100644
--- a/packages/ltx-pipelines/src/ltx_pipelines/utils/model_ledger.py
+++ b/packages/ltx-pipelines/src/ltx_pipelines/utils/model_ledger.py
@@ -7,9 +7,12 @@
 from ltx_core.loader.single_gpu_model_builder import SingleGPUModelBuilder as Builder
 from ltx_core.model.audio_vae import (
     AUDIO_VAE_DECODER_COMFY_KEYS_FILTER,
+    AUDIO_VAE_ENCODER_COMFY_KEYS_FILTER,
     VOCODER_COMFY_KEYS_FILTER,
     AudioDecoder,
     AudioDecoderConfigurator,
+    AudioEncoder,
+    AudioEncoderConfigurator,
     Vocoder,
     VocoderConfigurator,
 )
@@ -142,6 +145,13 @@ def build_model_builders(self) -> None:
                 registry=self.registry,
             )
 
+            self.audio_encoder_builder = Builder(
+                model_path=self.checkpoint_path,
+                model_class_configurator=AudioEncoderConfigurator,
+                model_sd_ops=AUDIO_VAE_ENCODER_COMFY_KEYS_FILTER,
+                registry=self.registry,
+            )
+
             if self.gemma_root_path is not None:
                 self.text_encoder_builder = Builder(
                     model_path=self.checkpoint_path,
@@ -177,6 +187,7 @@ def with_loras(self, loras: LoraPathStrengthAndSDOps) -> "ModelLedger":
         )
 
     def transformer(self) -> X0Model:
+        offload_config = {0: "0.2GiB", "cpu": "32GiB"}
         if not hasattr(self, "transformer_builder"):
             raise ValueError(
                 "Transformer not initialized. Please provide a checkpoint path to the ModelLedger constructor."
@@ -187,12 +198,11 @@ def transformer(self) -> X0Model:
                 module_ops=(UPCAST_DURING_INFERENCE,),
                 model_sd_ops=LTXV_MODEL_COMFY_RENAMING_WITH_TRANSFORMER_LINEAR_DOWNCAST_MAP,
             )
-            return X0Model(fp8_builder.build(device=self._target_device())).to(self.device).eval()
+            return X0Model(fp8_builder.build(device=self._target_device(), max_memory=offload_config))  # .to(self.device).eval()
         else:
             return (
-                X0Model(self.transformer_builder.build(device=self._target_device(), dtype=self.dtype))
-                .to(self.device)
-                .eval()
+                X0Model(self.transformer_builder.build(device=self._target_device(), dtype=self.dtype, max_memory=offload_config))
+                #.to(self.device).eval()
             )
 
     def video_decoder(self) -> VideoDecoder:
@@ -218,7 +228,7 @@ def text_encoder(self) -> AVGemmaTextEncoderModel:
                 "ModelLedger constructor."
             )
 
-        return self.text_encoder_builder.build(device=self._target_device(), dtype=self.dtype).to(self.device).eval()
+        return self.text_encoder_builder.build(device=self._target_device(), dtype=self.dtype)   # .to(self.device).eval()
 
     def audio_decoder(self) -> AudioDecoder:
         if not hasattr(self, "audio_decoder_builder"):
@@ -228,6 +238,14 @@ def audio_decoder(self) -> AudioDecoder:
 
         return self.audio_decoder_builder.build(device=self._target_device(), dtype=self.dtype).to(self.device).eval()
 
+    def audio_encoder(self) -> AudioEncoder:
+        if not hasattr(self, "audio_encoder_builder"):
+            raise ValueError(
+                "Audio encoder not initialized. Please provide a checkpoint path to the ModelLedger constructor."
+            )
+
+        return self.audio_encoder_builder.build(device=self._target_device(), dtype=self.dtype).to(self.device).eval()
+
     def vocoder(self) -> Vocoder:
         if not hasattr(self, "vocoder_builder"):
             raise ValueError(
@@ -239,5 +257,5 @@ def vocoder(self) -> Vocoder:
     def spatial_upsampler(self) -> LatentUpsampler:
         if not hasattr(self, "upsampler_builder"):
             raise ValueError("Upsampler not initialized. Please provide upsampler path to the ModelLedger constructor.")
-
-        return self.upsampler_builder.build(device=self._target_device(), dtype=self.dtype).to(self.device).eval()
+        offload_config = {0: "0.1GiB", "cpu": "32GiB"}
+        return self.upsampler_builder.build(device=self._target_device(), dtype=self.dtype, max_memory=offload_config)  # .to(self.device).eval()
diff --git a/web_ui_v2.py b/web_ui_v2.py
new file mode 100644
index 00000000..4d7722ef
--- /dev/null
+++ b/web_ui_v2.py
@@ -0,0 +1,302 @@
+import gradio as gr
+import subprocess
+import os
+import datetime
+import uuid
+import sys
+import shlex
+
+# --- Configuration & Defaults ---
+DEFAULT_CHECKPOINT = "./models/ltx-2-19b-distilled-fp8.safetensors"
+DEFAULT_GEMMA = "./models/gemma3"
+DEFAULT_UPSAMPLER = "./models/ltx-2-spatial-upscaler-x2-1.0.safetensors"
+LORA_ROOT = "./models/loras"
+
+# LoRA List
+LORA_OPTIONS = [
+    #"LTX-2-19b-IC-LoRA-Canny-Control",
+    #"LTX-2-19b-IC-LoRA-Depth-Control",
+    #"LTX-2-19b-IC-LoRA-Detailer",
+    #"LTX-2-19b-IC-LoRA-Pose-Control",
+    "LTX-2-19b-LoRA-Camera-Control-Dolly-In",
+    "LTX-2-19b-LoRA-Camera-Control-Dolly-Left",
+    "LTX-2-19b-LoRA-Camera-Control-Dolly-Out",
+    "LTX-2-19b-LoRA-Camera-Control-Dolly-Right",
+    "LTX-2-19b-LoRA-Camera-Control-Jib-Down",
+    "LTX-2-19b-LoRA-Camera-Control-Jib-Up",
+    "LTX-2-19b-LoRA-Camera-Control-Static"
+]
+
+# Resolution Presets with Max Frame Data for 8GB VRAM
+PRESETS = {
+    "1280x704 (Landscape)": {"w": 1280, "h": 704, "max_frames": 225},
+    "704x1280 (Vertical)": {"w": 704, "h": 1280, "max_frames": 225},
+
+    "1536x1024 (Standard)": {"w": 1536, "h": 1024, "max_frames": 121},
+    "1024x1536 (Vertical)": {"w": 1024, "h": 1536, "max_frames": 121},
+
+    "1600x896 (Landscape)": {"w": 1600, "h": 896, "max_frames": 145},
+    "896x1600 (Vertical)": {"w": 896, "h": 1600, "max_frames": 145},
+
+    "1920x1088 (HD)": {"w": 1920, "h": 1088, "max_frames": 97},
+    "1088x1920 (HD Vert)": {"w": 1088, "h": 1920, "max_frames": 97},
+
+    "2560x1408 (2K)": {"w": 2560, "h": 1408, "max_frames": 49},
+    "1408x2560 (2K Vert)": {"w": 1408, "h": 2560, "max_frames": 49},
+
+    "3840x2176 (4K)": {"w": 3840, "h": 2176, "max_frames": 17},
+}
+
+
+# --- Logic Functions ---
+
+def get_preset_frames(preset_key, is_safe_mode, current_val):
+    """Updates frame count slider based on preset and safe mode"""
+    if not is_safe_mode:
+        return current_val  # Do not change if safe mode is off
+
+    if preset_key in PRESETS:
+        return PRESETS[preset_key]["max_frames"]
+    return 121
+
+
+def run_generation(
+        prompt,
+        resolution_preset,
+        num_frames,
+        frame_rate,
+        steps,
+        seed,
+        randomize_seed,
+        enhance_prompt,
+        enable_fp8,
+        # Paths
+        checkpoint_path,
+        gemma_path,
+        upsampler_path,
+        # Images
+        img1_path, img1_idx, img1_str,
+        img2_path, img2_idx, img2_str,
+        img3_path, img3_idx, img3_str,
+        # LoRAs
+        selected_loras
+):
+    # 1. Setup Data
+    width = PRESETS[resolution_preset]["w"]
+    height = PRESETS[resolution_preset]["h"]
+
+    if randomize_seed:
+        seed = int(os.urandom(4).hex(), 16) % (2 ** 32)
+
+    timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
+    unique_id = str(uuid.uuid4())[:8]
+    output_filename = f"output_{timestamp}_{unique_id}.mp4"
+    output_path = os.path.abspath(output_filename)
+
+    # 2. Build Command
+    cmd = [
+        sys.executable, "-m", "ltx_pipelines.distilled",
+        "--checkpoint-path", checkpoint_path,
+        "--gemma-root", gemma_path,
+        "--spatial-upsampler-path", upsampler_path,
+        "--prompt", prompt,
+        "--output-path", output_path,
+        "--width", str(width),
+        "--height", str(height),
+        "--num-frames", str(int(num_frames)),
+        "--frame-rate", str(frame_rate),
+        "--num-inference-steps", str(int(steps)),
+        "--seed", str(int(seed))
+    ]
+
+    if enable_fp8:
+        cmd.append("--enable-fp8")
+    if enhance_prompt:
+        cmd.append("--enhance-prompt")
+
+    # Images
+    images = [
+        (img1_path, img1_idx, img1_str),
+        (img2_path, img2_idx, img2_str),
+        (img3_path, img3_idx, img3_str)
+    ]
+    for path, idx, strength in images:
+        if path is not None:
+            latent_idx = int(idx) // 8
+            cmd.extend(["--image", path, str(latent_idx), str(float(strength))])
+
+    # LoRAs
+    for lora_name in selected_loras:
+        lora_full_path = os.path.join(LORA_ROOT, f"{lora_name.lower()}.safetensors")
+        cmd.extend(["--lora", lora_full_path, "1.0"])
+
+    # 3. Execution with Real-time Logging
+    full_command_str = " ".join(shlex.quote(arg) for arg in cmd)
+    log_buffer = f"Command:\n{full_command_str}\n\n--- OUTPUT LOG ---\n"
+    yield None, log_buffer  # Clear video, show start log
+
+    try:
+        # Popen allows reading stdout line by line
+        process = subprocess.Popen(
+            cmd,
+            stdout=subprocess.PIPE,
+            stderr=subprocess.STDOUT,
+            text=True,
+            bufsize=1,
+            universal_newlines=True
+        )
+
+        for line in process.stdout:
+            log_buffer += line
+            yield None, log_buffer  # Stream logs to UI
+
+        process.wait()
+
+        if process.returncode == 0 and os.path.exists(output_path):
+            log_buffer += f"\n\nSUCCESS: Video saved to {output_path}"
+            yield output_path, log_buffer
+        else:
+            log_buffer += f"\n\nERROR: Process failed or output file missing."
+            yield None, log_buffer
+
+    except Exception as e:
+        log_buffer += f"\n\nEXCEPTION: {str(e)}"
+        yield None, log_buffer
+
+
+# --- UI Theme & Layout ---
+
+# Custom neutral theme (slate/gray)
+theme = gr.themes.Soft(
+    primary_hue="slate",
+    secondary_hue="gray",
+    neutral_hue="slate",
+    font=[gr.themes.GoogleFont("Inter"), "ui-sans-serif", "system-ui"]
+).set(
+    body_background_fill="*neutral_50",
+    block_background_fill="*neutral_100",
+    button_primary_background_fill="*primary_600",
+    button_primary_text_color="white",
+)
+
+css = """
+.gradio-container { max_width: 1400px !important; }
+textarea { font-family: monospace; }
+"""
+
+with gr.Blocks(title="LTX-2 Studio", theme=theme, css=css) as demo:
+    gr.Markdown("## 🎬 LTX-2 Distilled Web Interface")
+
+    with gr.Row():
+        # Left Column: Controls
+        with gr.Column(scale=3):
+            prompt = gr.Textbox(label="Prompt", placeholder="Describe your video scene here...", lines=3)
+
+            with gr.Row():
+                with gr.Column(scale=1):
+                    # Resolution & Safe Mode
+                    preset = gr.Dropdown(
+                        label="Resolution",
+                        choices=list(PRESETS.keys()),
+                        value="1536x1024 (Standard)"
+                    )
+                    safe_mode = gr.Checkbox(
+                        label="8GB VRAM Safe Mode",
+                        value=True,
+                        info="Auto-limits max frames to prevent OOM"
+                    )
+
+                with gr.Column(scale=1):
+                    # Generation Params
+                    num_frames = gr.Slider(label="Number of Frames", minimum=9, maximum=257, step=8, value=121)
+                    fps = gr.Slider(label="Frame Rate", minimum=8, maximum=60, step=1, value=24)
+
+            with gr.Accordion("Advanced Settings", open=False):
+                with gr.Row():
+                    steps = gr.Slider(label="Inference Steps", minimum=8, maximum=8, step=1, value=8, info="Fixed to 8 for distilled model")
+                    seed = gr.Number(label="Seed", value=10, precision=0)
+
+                with gr.Row():
+                    random_seed = gr.Checkbox(label="Randomize Seed", value=True)
+                    enable_fp8 = gr.Checkbox(label="Enable FP8 (Required for 8GB vram)", value=True)
+                    enhance_prompt = gr.Checkbox(label="Enhance Prompt (slow +1..3 min)", value=False)
+
+                gr.Markdown("### Model Paths")
+                checkpoint_path = gr.Textbox(label="Checkpoint", value=DEFAULT_CHECKPOINT)
+                gemma_path = gr.Textbox(label="Gemma Root", value=DEFAULT_GEMMA)
+                upsampler_path = gr.Textbox(label="Upsampler", value=DEFAULT_UPSAMPLER)
+
+        # Right Column: Output & Media
+        with gr.Column(scale=4):
+            out_video = gr.Video(label="Generated Result", height=400)
+            generate_btn = gr.Button("🚀 Generate Video", variant="primary", size="lg")
+
+            with gr.Accordion("Console Log", open=True):
+                console_log = gr.Textbox(label="Terminal Output", lines=10, max_lines=20, interactive=False,
+                                         elem_id="console_log")
+
+    gr.Markdown("---")
+
+    with gr.Row():
+        # LoRAs Column
+        with gr.Column(scale=1):
+            gr.Markdown("### 🎨 LoRA Adapters")
+            lora_checks = gr.CheckboxGroup(
+                choices=LORA_OPTIONS,
+                label=None,
+                info="Applied with strength 1.0"
+            )
+
+        # Image Conditioning Column
+        with gr.Column(scale=2):
+            gr.Markdown("### 🖼️ Image Conditioning (Optional)")
+            with gr.Row():
+                # Image 1
+                with gr.Group():
+                    i1_img = gr.Image(type="filepath", label="Ref Image 1", height=150)
+                    i1_idx = gr.Number(label="Frame Index", value=0)
+                    i1_str = gr.Slider(label="Strength", minimum=0, maximum=1, value=0.8)
+
+                # Image 2
+                with gr.Group():
+                    i2_img = gr.Image(type="filepath", label="Ref Image 2", height=150)
+                    i2_idx = gr.Number(label="Frame Index", value=0)
+                    i2_str = gr.Slider(label="Strength", minimum=0, maximum=1, value=0.8)
+
+                # Image 3
+                with gr.Group():
+                    i3_img = gr.Image(type="filepath", label="Ref Image 3", height=150)
+                    i3_idx = gr.Number(label="Frame Index", value=0)
+                    i3_str = gr.Slider(label="Strength", minimum=0, maximum=1, value=0.8)
+
+    # --- Event Wiring ---
+
+    # Logic: When Preset OR Safe Mode changes, update the Num Frames Slider
+    preset.change(
+        fn=get_preset_frames,
+        inputs=[preset, safe_mode, num_frames],
+        outputs=num_frames
+    )
+
+    safe_mode.change(
+        fn=get_preset_frames,
+        inputs=[preset, safe_mode, num_frames],
+        outputs=num_frames
+    )
+
+    # Logic: Run Generation
+    generate_btn.click(
+        fn=run_generation,
+        inputs=[
+            prompt, preset, num_frames, fps, steps, seed, random_seed, enhance_prompt, enable_fp8,
+            checkpoint_path, gemma_path, upsampler_path,
+            i1_img, i1_idx, i1_str,
+            i2_img, i2_idx, i2_str,
+            i3_img, i3_idx, i3_str,
+            lora_checks
+        ],
+        outputs=[out_video, console_log]
+    )
+
+if __name__ == "__main__":
+    demo.launch(server_name="0.0.0.0", share=False)
\ No newline at end of file
diff --git a/web_ui_v4.py b/web_ui_v4.py
new file mode 100644
index 00000000..654e7f26
--- /dev/null
+++ b/web_ui_v4.py
@@ -0,0 +1,492 @@
+import gradio as gr
+import subprocess
+import os
+import datetime
+import uuid
+import threading
+import time
+import sys
+from collections import deque
+
+# --- Configuration & Defaults ---
+DEFAULT_CHECKPOINT = "./models/ltx-2-19b-distilled-fp8.safetensors"
+DEFAULT_GEMMA = "./models/gemma3"
+DEFAULT_UPSAMPLER = "./models/ltx-2-spatial-upscaler-x2-1.0.safetensors"
+LORA_ROOT = "./models/loras"
+
+# LoRA List
+LORA_OPTIONS = [
+    "LTX-2-19b-LoRA-Camera-Control-Dolly-In",
+    "LTX-2-19b-LoRA-Camera-Control-Dolly-Left",
+    "LTX-2-19b-LoRA-Camera-Control-Dolly-Out",
+    "LTX-2-19b-LoRA-Camera-Control-Dolly-Right",
+    "LTX-2-19b-LoRA-Camera-Control-Jib-Down",
+    "LTX-2-19b-LoRA-Camera-Control-Jib-Up",
+    "LTX-2-19b-LoRA-Camera-Control-Static"
+]
+
+# Resolution Presets with Max Frame Data for 8GB VRAM
+PRESETS = {
+    "1280x704 (Landscape)": {"w": 1280, "h": 704, "max_frames": 225},
+    "704x1280 (Vertical)": {"w": 704, "h": 1280, "max_frames": 225},
+    "1536x1024 (Standard)": {"w": 1536, "h": 1024, "max_frames": 121},
+    "1024x1536 (Vertical)": {"w": 1024, "h": 1536, "max_frames": 121},
+    "1600x896 (Landscape)": {"w": 1600, "h": 896, "max_frames": 145},
+    "896x1600 (Vertical)": {"w": 896, "h": 1600, "max_frames": 145},
+    "1920x1088 (HD)": {"w": 1920, "h": 1088, "max_frames": 97},
+    "1088x1920 (HD Vert)": {"w": 1088, "h": 1920, "max_frames": 97},
+    "2560x1408 (2K)": {"w": 2560, "h": 1408, "max_frames": 49},
+    "1408x2560 (2K Vert)": {"w": 1408, "h": 2560, "max_frames": 49},
+    "3840x2176 (4K)": {"w": 3840, "h": 2176, "max_frames": 17},
+}
+
+# Cinematic
+
+# --- Prompt Construction Data ---
+# Animation: stop-motion, 2D/3D animation, claymation, hand-drawn
+# Stylized: comic book, cyberpunk, 8-bit pixel, surreal, minimalist, painterly, illustrated
+# Cinematic: period drama, film noir, fantasy, epic space opera, thriller, modern romance, experimental film, arthouse, documentary
+STYLES = ["Cinematic", "Photorealistic", "3D Animation", "Anime", "Vintage Film (VHS)", "Film Noir", "Cyberpunk",
+          "Oil Painting", "Claymation"]
+SHOT_TYPES = ["Wide establishing", "Medium", "Close-up", "Extreme close-up", "Over-the-shoulder", "Low angle",
+              "High angle", "Overhead"]
+LIGHTING = ["Natural sunlight", "Golden hour", "Cinematic", "Volumetric fog", "Neon glow", "Dark and moody",
+            "Studio lighting", "Soft rim light"]
+CAM_MOVES = ["static frame", "wide establishing shot", "over-the-shoulder", "handheld movement", "overhead view", "pushes in", "pulls back",
+             "tilts upward", "circles around", "pans across", "follows", "tracks"]  # ok
+
+
+# Establish the shot. Use cinematography terms that match your preferred film genre. Include aspects like scale or specific category characteristics to further refine the style you’re looking for.
+# Set the scene. Describe lighting conditions, color palette, surface textures, and atmosphere to shape the mood.
+# Describe the action. Write the core action as a natural sequence, flowing from beginning to end.
+# Define your character(s). Include age, hairstyle, clothing, and distinguishing details. Express emotions through physical cues.
+# Identify camera movement(s). Specify when the view should shift and how. Including how subjects or objects appear after the camera motion gives the model a better idea of how to finish the motion.‍
+# Describe the audio. Use clear descriptions for ambient sounds, music, audio, and speech. For dialogue, place the text between quotation marks and (if required) mention the language and accent you would like the character to have.
+
+
+# Keep your prompt in a single flowing paragraph to give the model a cohesive scene to work with.
+# Use present tense verbs to describe movement and action.
+# Match your detail to the shot scale. Closeups need more precise detail than wide shots.
+# When describing camera movement, focus on the camera’s relationship to the subject.
+# You should expect to write 4 to 8 descriptive sentences to cover all the key aspects of the prompt.
+# Don’t be afraid to iterate! LTX-2 is designed for fast experimentation, so refining your prompt is part of the workflow.
+
+# Scale indicators: expansive, epic, intimate, claustrophobic
+# Film characteristics: jittery stop-motion, pixelated edges, lens flares, film grain
+# Pacing and temporal effects: slow motion, time-lapse, rapid cuts, lingering shot, continuous shot, freeze-frame, fade-in, fade-out, seamless transition, dynamic movement, sudden stop
+# Specific visual effects (if relevant): particle systems, motion blur, depth of field
+
+# Lighting conditions: flickering candles, neon glow, natural sunlight, dramatic shadows
+# Textures: rough stone, smooth metal, worn fabric, glossy surfaces
+# Color palette: vibrant, muted, monochromatic, high contrast
+# Atmospheric elements: fog, rain, dust, particles, smoke
+
+# Sound and Voice
+# Setting: Ambient coffeeshop noises, dripping rain and wind blowing, forest ambience with birds singing
+# Dialogue style: Energetic announcer, resonant voice with gravitas, distorted radio-style, robotic monotone, childlike curiosity
+# Volume: quiet whisper, mutters, shouts, screams
+
+
+# ‍Cinematic close-up shot ‍cinematic lighting shallow depth of field, and natural motion.
+
+# What Works Well with LTX-2
+# ‍Cinematic compositions:
+# ‍Wide, medium, and close-up shots with thoughtful lighting, shallow depth of field, and natural motion.
+# Emotive human moments:
+# ‍LTX-2 excels at single-subject emotional expressions, subtle gestures, and facial nuance.
+# Atmosphere & setting:
+# ‍Weather effects like fog, mist, golden hour light, soft shadows, rain, reflections, and ambient textures all help ground the scene.
+# Clean, readable camera language:
+# ‍Clear directions like “slow dolly in,” “handheld tracking,” or “over-the-shoulder” improve consistency.
+# Stylized aesthetics:
+# ‍Painterly, noir, analog film look, fashion editorial, pixelated animation, or surreal art styles work especially well when named early in the prompt.
+# Lighting and mood control:
+# ‍Backlighting, color palettes, soft rim light, flickering lamps — these anchor tone better than generic mood words.
+# Voice:
+# ‍Characters can talk and sing in various languages.
+
+# --- Global Queue System ---
+JOB_QUEUE = deque()
+QUEUE_LOCK = threading.Lock()
+CURRENT_LOG = "System Ready. Waiting for jobs..."
+LATEST_VIDEO_PATH = None
+IS_PROCESSING = False
+CURRENT_JOB_ID = None
+CURRENT_PROCESS = None
+PREVIEW_VIDEO_PATH = None
+CURRENT_OUTPUT_PATH = None
+
+
+# --- Logic Functions ---
+
+def build_ltx_prompt_text(style, shot, subject, env, light, cam):
+    parts = []
+    opener = ""
+    if style: opener += f"{style} "
+    if shot: opener += f"{shot} shot"
+    if opener:
+        parts.append(f"A {opener.strip()} of")
+    else:
+        parts.append("A shot of")
+
+    parts.append(subject if subject else "a subject")
+
+    loc_details = []
+    if env: loc_details.append(f"in {env}")
+    if light: loc_details.append(f"with {light}")
+    if loc_details: parts.append(" ".join(loc_details))
+
+    full_text = " ".join(parts) + "."
+    if cam: full_text += f" The camera {cam.lower()}."
+    return full_text
+
+
+def get_preset_frames(preset_key, is_safe_mode, current_val):
+    if not is_safe_mode: return current_val
+    if preset_key in PRESETS: return PRESETS[preset_key]["max_frames"]
+    return 121
+
+
+# --- Worker Logic ---
+
+def process_job_logic(job):
+    """Internal function to run the actual generation logic"""
+    global CURRENT_LOG, LATEST_VIDEO_PATH, CURRENT_PROCESS, PREVIEW_VIDEO_PATH, CURRENT_OUTPUT_PATH
+
+    # Reset preview/process state
+    PREVIEW_VIDEO_PATH = None
+    CURRENT_PROCESS = None
+    prompt = job['prompt']
+    width = PRESETS[job['preset']]["w"]
+    height = PRESETS[job['preset']]["h"]
+    seed = job['seed']
+    if job['randomize_seed']:
+        seed = int(os.urandom(4).hex(), 16) % (2 ** 32)
+
+    timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
+    output_filename = f"output_{timestamp}_{job['id']}.mp4"
+    output_path = os.path.abspath(output_filename)
+    CURRENT_OUTPUT_PATH = output_path
+
+    CURRENT_LOG += f"\n\n--- STARTED JOB: {job['id']} ---\nPrompt: {prompt}\nSeed: {seed}\n"
+
+    # Build Command
+    cmd = [
+        sys.executable, "-m", "ltx_pipelines.distilled",
+        #"kernprof", "-l", "-v", "-m", "ltx_pipelines.distilled",
+        "--checkpoint-path", job['checkpoint_path'],
+        "--gemma-root", job['gemma_path'],
+        "--spatial-upsampler-path", job['upsampler_path'],
+        "--prompt", prompt,
+        "--output-path", output_path,
+        "--width", str(width),
+        "--height", str(height),
+        "--num-frames", str(int(job['num_frames'])),
+        "--frame-rate", str(job['frame_rate']),
+        "--num-inference-steps", str(int(job['steps'])),
+        "--seed", str(int(seed)),
+        # "--enable-chunked-stage2"
+    ]
+
+    if job['enable_fp8']: cmd.append("--enable-fp8")
+    if job['enhance_prompt']: cmd.append("--enhance-prompt")
+    if job['disable_audio']: cmd.append("--disable-audio")
+
+    # Images
+    for path, idx, strength in job['images']:
+        if path is not None:
+            latent_idx = int(idx) // 8
+            cmd.extend(["--image", path, str(latent_idx), str(float(strength))])
+
+    # LoRAs
+    for lora_name in job['loras']:
+        lora_full_path = os.path.join(LORA_ROOT, f"{lora_name.lower()}.safetensors")
+        cmd.extend(["--lora", lora_full_path, "1.0"])
+
+    # Execution
+    try:
+        CURRENT_PROCESS = subprocess.Popen(
+            cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
+            text=True, bufsize=1, universal_newlines=True
+        )
+
+        for line in CURRENT_PROCESS.stdout:
+            CURRENT_LOG += line
+
+        CURRENT_PROCESS.wait()
+
+        if CURRENT_PROCESS.returncode == 0 and os.path.exists(output_path):
+            CURRENT_LOG += f"\n--- JOB COMPLETE ---\nSaved: {output_path}\n"
+            LATEST_VIDEO_PATH = output_path
+        else:
+            CURRENT_LOG += f"\n--- JOB FAILED ---\nReturn Code: {CURRENT_PROCESS.returncode}\n"
+    except Exception as e:
+        CURRENT_LOG += f"\n--- EXCEPTION ---\n{str(e)}\n"
+    finally:
+        CURRENT_PROCESS = None
+
+
+def cancel_job():
+    global CURRENT_PROCESS, CURRENT_LOG
+    if CURRENT_PROCESS:
+        try:
+            subprocess.run(["taskkill", "/F", "/T", "/PID", str(CURRENT_PROCESS.pid)], capture_output=True)
+            CURRENT_LOG += "\n--- JOB CANCELLATION REQUESTED AND EXECUTED ---\n"
+            return "Cancellation requested and executed via taskkill."
+        except Exception as e:
+            try:
+                CURRENT_PROCESS.terminate()
+                return f"Taskkill failed, tried terminate: {str(e)}"
+            except:
+                return f"Error cancelling: {str(e)}"
+    return "No active process to cancel."
+
+
+def worker_thread():
+    """Background thread that constantly checks for jobs"""
+    global IS_PROCESSING, CURRENT_JOB_ID, CURRENT_LOG
+    while True:
+        job = None
+        with QUEUE_LOCK:
+            if len(JOB_QUEUE) > 0:
+                job = JOB_QUEUE.popleft()
+                IS_PROCESSING = True
+                CURRENT_JOB_ID = job['id']
+            else:
+                IS_PROCESSING = False
+                CURRENT_JOB_ID = None
+
+        if job:
+            process_job_logic(job)
+        else:
+            time.sleep(1)
+
+
+# Start Worker
+threading.Thread(target=worker_thread, daemon=True).start()
+
+
+# --- UI Functions ---
+
+def enqueue_job(
+        prompt, preset, num_frames, disable_audio, frame_rate, steps, seed, randomize_seed, enhance_prompt, enable_fp8,
+        checkpoint_path, gemma_path, upsampler_path,
+        img1_path, img1_idx, img1_str,
+        img2_path, img2_idx, img2_str,
+        img3_path, img3_idx, img3_str,
+        selected_loras
+):
+    job_id = str(uuid.uuid4())[:4]
+
+    job_data = {
+        "id": job_id,
+        "prompt": prompt,
+        "preset": preset,
+        "num_frames": num_frames,
+        "disable_audio": disable_audio,
+        "frame_rate": frame_rate,
+        "steps": steps,
+        "seed": seed,
+        "randomize_seed": randomize_seed,
+        "enhance_prompt": enhance_prompt,
+        "enable_fp8": enable_fp8,
+        "checkpoint_path": checkpoint_path,
+        "gemma_path": gemma_path,
+        "upsampler_path": upsampler_path,
+        "images": [
+            (img1_path, img1_idx, img1_str),
+            (img2_path, img2_idx, img2_str),
+            (img3_path, img3_idx, img3_str)
+        ],
+        "loras": selected_loras
+    }
+
+    with QUEUE_LOCK:
+        JOB_QUEUE.append(job_data)
+        q_pos = len(JOB_QUEUE)
+
+    return f"Job {job_id} queued. Position: {q_pos}"
+
+
+def update_monitor():
+    """Polled by the UI to get latest logs and queue status"""
+    global PREVIEW_VIDEO_PATH
+    
+    # Check for intermediate preview file
+    if CURRENT_OUTPUT_PATH and PREVIEW_VIDEO_PATH is None:
+        preview_file = CURRENT_OUTPUT_PATH.replace('.mp4', '_.mp4')
+        if os.path.exists(preview_file):
+            PREVIEW_VIDEO_PATH = preview_file
+
+    q_len = len(JOB_QUEUE)
+    status_str = f"Queue Size: {q_len}"
+    if IS_PROCESSING:
+        status_str += f" | Processing Job: {CURRENT_JOB_ID}"
+    else:
+        status_str += " | Idle"
+
+    return LATEST_VIDEO_PATH, PREVIEW_VIDEO_PATH, CURRENT_LOG, status_str
+
+
+# --- UI Theme & Layout ---
+
+theme = gr.themes.Soft(
+    primary_hue="slate",
+    secondary_hue="gray",
+    neutral_hue="slate",
+    font=[gr.themes.GoogleFont("Inter"), "ui-sans-serif", "system-ui"]
+).set(
+    body_background_fill="*neutral_50",
+    block_background_fill="*neutral_100",
+    button_primary_background_fill="*primary_600",
+    button_primary_text_color="white",
+)
+
+css = """
+.gradio-container { max_width: 1400px !important; }
+textarea { font-family: monospace; }
+#status_box { font-weight: bold; color: #475569; }
+"""
+
+with gr.Blocks(title="LTX-2 Studio + Queue", theme=theme, css=css) as demo:
+    gr.Markdown("## 🎬 LTX-2 Distilled Web Interface (Queue Enabled)")
+
+    with gr.Row():
+        # Left Column: Controls
+        with gr.Column(scale=3):
+            # --- Prompt Constructor ---
+            with gr.Accordion("📝 Prompt Constructor (LTX Guide Based)", open=False):
+                with gr.Row():
+                    pc_style = gr.Dropdown(choices=STYLES, label="Style", value="Cinematic")
+                    pc_shot = gr.Dropdown(choices=SHOT_TYPES, label="Shot Type", value="Medium")
+                    pc_light = gr.Dropdown(choices=LIGHTING, label="Lighting", value="Golden hour")
+
+                pc_subject = gr.Textbox(label="Subject & Action", placeholder="e.g., a futuristic robot walking...",
+                                        lines=2)
+                pc_env = gr.Textbox(label="Environment", placeholder="e.g., a dusty desert")
+
+                with gr.Row():
+                    pc_cam = gr.Dropdown(choices=CAM_MOVES, label="Camera Movement", value="static frame")
+                    pc_build_btn = gr.Button("⬇️ Insert into Prompt", variant="secondary")
+
+            # Main Prompt
+            prompt = gr.Textbox(label="Final Prompt", placeholder="Describe your video scene here...", lines=3)
+
+            with gr.Row():
+                with gr.Column(scale=1):
+                    preset = gr.Dropdown(label="Resolution", choices=list(PRESETS.keys()), value="1536x1024 (Standard)")
+                    safe_mode = gr.Checkbox(label="8GB Safe Mode", value=True)
+                    disable_audio = gr.Checkbox(label="Disable audio", value=False)
+
+                with gr.Column(scale=1):
+                    num_frames = gr.Slider(label="Frames", minimum=9, maximum=257, step=8, value=121)
+                    fps = gr.Slider(label="FPS", minimum=8, maximum=60, step=1, value=24)
+
+            with gr.Accordion("Advanced", open=False):
+                with gr.Row():
+                    steps = gr.Slider(label="Steps", minimum=1, maximum=50, value=8)
+                    seed = gr.Number(label="Seed", value=10, precision=0)
+                with gr.Row():
+                    random_seed = gr.Checkbox(label="Random Seed", value=True)
+                    enable_fp8 = gr.Checkbox(label="FP8", value=True)
+                    enhance_prompt = gr.Checkbox(label="Enhance", value=False)
+
+                checkpoint_path = gr.Textbox(label="Checkpoint", value=DEFAULT_CHECKPOINT)
+                gemma_path = gr.Textbox(label="Gemma Root", value=DEFAULT_GEMMA)
+                upsampler_path = gr.Textbox(label="Upsampler", value=DEFAULT_UPSAMPLER)
+
+        # Right Column: Output & Monitor
+        with gr.Column(scale=4):
+            # Status Bar
+            queue_status = gr.Textbox(label="System Status", value="Idle", interactive=False, elem_id="status_box")
+
+            # Video Output
+            out_video = gr.Video(label="Last Completed Video", height=400, autoplay=True)
+            
+            # Preview Video
+            preview_video = gr.Video(label="Stage 1 Preview", height=300, autoplay=True)
+
+            # Button (Adds to Queue)
+            with gr.Row():
+                generate_btn = gr.Button("➕ Add to Queue", variant="primary", size="lg")
+                cancel_btn = gr.Button("🛑 Cancel Current Job", variant="secondary", size="lg")
+            
+            add_result_msg = gr.Markdown("")  # Feedback for button click
+
+            with gr.Accordion("Console Log", open=True):
+                console_log = gr.Textbox(label="Worker Log", lines=10, max_lines=20, interactive=False,
+                                         elem_id="console_log")
+
+    gr.Markdown("---")
+
+    with gr.Row():
+        with gr.Column(scale=1):
+            gr.Markdown("### 🎨 LoRA Adapters")
+            lora_checks = gr.CheckboxGroup(choices=LORA_OPTIONS, label=None)
+
+        with gr.Column(scale=2):
+            gr.Markdown("### 🖼️ Image Conditioning")
+            with gr.Row():
+                # Image 1
+                with gr.Group():
+                    i1_img = gr.Image(type="filepath", label="Ref Image 1", height=150)
+                    i1_idx = gr.Number(label="Frame Index", value=0)
+                    i1_str = gr.Slider(label="Strength", minimum=0, maximum=1, value=0.8)
+
+                # Image 2
+                with gr.Group():
+                    i2_img = gr.Image(type="filepath", label="Ref Image 2", height=150)
+                    i2_idx = gr.Number(label="Frame Index", value=0)
+                    i2_str = gr.Slider(label="Strength", minimum=0, maximum=1, value=0.8)
+
+                # Image 3
+                with gr.Group():
+                    i3_img = gr.Image(type="filepath", label="Ref Image 3", height=150)
+                    i3_idx = gr.Number(label="Frame Index", value=0)
+                    i3_str = gr.Slider(label="Strength", minimum=0, maximum=1, value=0.8)
+
+    # --- Event Wiring ---
+
+    # --- Timers & Events ---
+
+    timer = gr.Timer(1)
+    timer.tick(
+        fn=update_monitor,
+        inputs=None,
+        outputs=[out_video, preview_video, console_log, queue_status]
+    )
+
+    # 2. Add to Queue Action
+    generate_btn.click(
+        fn=enqueue_job,
+        inputs=[
+            prompt, preset, num_frames, disable_audio, fps, steps, seed, random_seed, enhance_prompt, enable_fp8,
+            checkpoint_path, gemma_path, upsampler_path,
+            i1_img, i1_idx, i1_str,
+            i2_img, i2_idx, i2_str,
+            i3_img, i3_idx, i3_str,
+            lora_checks
+        ],
+        outputs=[add_result_msg]
+    )
+
+    # 2b. Cancel Job Action
+    cancel_btn.click(
+        fn=cancel_job,
+        inputs=None,
+        outputs=[add_result_msg]
+    )
+
+    # 3. Prompt Construction
+    pc_build_btn.click(
+        fn=build_ltx_prompt_text,
+        inputs=[pc_style, pc_shot, pc_subject, pc_env, pc_light, pc_cam],
+        outputs=prompt
+    )
+
+    # 4. Presets Logic
+    preset.change(fn=get_preset_frames, inputs=[preset, safe_mode, num_frames], outputs=num_frames)
+    safe_mode.change(fn=get_preset_frames, inputs=[preset, safe_mode, num_frames], outputs=num_frames)
+
+if __name__ == "__main__":
+    demo.launch(server_name="0.0.0.0", share=False)