-
Notifications
You must be signed in to change notification settings - Fork 554
Add tokenizer cache, and offloading for acestep-5Hz-lm-4B to fit on 8GB vram #265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add tokenizer cache, and offloading for acestep-5Hz-lm-4B to fit on 8GB vram
|
Tested on 8 Gb VRAM all models 0.6b 1.7b 4b, quality not very good, problem with loudness (sometimes to low but most time to loud over limit), and sound not clean, not production quality. HeartMula do better quality but less controlable, tags not work. |
|
If you aren't getting better results than HeartMula, you must be using it incorrectly, or there’s an environment issue causing bugs in generation. HeartMula is genuinely terrible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Adds faster LLM startup and lower-VRAM operation by caching the tokenizer and supporting LLM unload/reload plus device-map based offloading to help the 4B 5Hz LM fit on 8GB VRAM.
Changes:
- Add gzip+pickle tokenizer cache to drastically reduce tokenizer load time on subsequent runs.
- Switch LLM/DiT loading to
device_map="auto"offloading patterns and move inputs to the model’s actual device. - Add LLM unload/reload flow and adjust UI/config defaults aimed at 8GB VRAM usage.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| acestep/llm_inference.py | Tokenizer caching + LLM device-map loading, device-aware forward pass, unload/reload lifecycle. |
| acestep/inference.py | Reload LLM on-demand and unload it before DiT/VAE to free VRAM. |
| acestep/handler.py | Add optional Accelerate offloading for the main model; device-aware decoding and autocast tweaks. |
| acestep/gradio_ui/events/results_handlers.py | Change default batch size from 2 to 1 in batch navigation/storage. |
| acestep/gradio_ui/events/generation_handlers.py | Add one more gr.update() output to keep UI outputs aligned. |
| acestep/gpu_config.py | Expand available LM models (incl. 4B) and adjust batch-size defaults per GPU tier. |
| acestep/acestep_v15_pipeline.py | Disable automatic offloading unconditionally. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| with gzip.open(cache_path, "rb") as f: | ||
| tokenizer = pickle.load(f) | ||
| load_time = time.time() - start_time |
Copilot
AI
Feb 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loading a tokenizer via pickle.load() from a file on disk is unsafe because pickle is code-executable if the cache file is ever tampered with. Prefer a non-executable cache format (e.g., tokenizer.save_pretrained(cache_dir) + AutoTokenizer.from_pretrained(cache_dir)), or at minimum add strong safeguards (e.g., opt-in flag, verify cache file ownership/permissions, and handle/avoid untrusted paths).
| first_device = next(iter(model.hf_device_map.values())) | ||
| return torch.device(first_device) |
Copilot
AI
Feb 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hf_device_map values are not guaranteed to be directly accepted by torch.device(...) (they may be ints like 0, or strings like 'disk'). This can raise at runtime. Consider normalizing common cases (e.g., int -> f\"cuda:{i}\", 'disk' -> 'cpu') and falling back safely when the value is not a valid torch.device spec.
| first_device = next(iter(model.hf_device_map.values())) | |
| return torch.device(first_device) | |
| first_device_raw = next(iter(model.hf_device_map.values())) | |
| # Normalize common hf_device_map values so they are acceptable to torch.device | |
| normalized_device: Union[str, int] | |
| if isinstance(first_device_raw, int): | |
| # Interpret integers as CUDA device indices when available, otherwise fall back to CPU | |
| if torch.cuda.is_available(): | |
| normalized_device = f"cuda:{first_device_raw}" | |
| else: | |
| normalized_device = "cpu" | |
| elif isinstance(first_device_raw, str): | |
| if first_device_raw == "disk": | |
| # 'disk' is not a valid torch.device; default to CPU for computations | |
| normalized_device = "cpu" | |
| else: | |
| normalized_device = first_device_raw | |
| else: | |
| # Fallback: rely on string representation, may still fail below | |
| normalized_device = str(first_device_raw) | |
| try: | |
| return torch.device(normalized_device) | |
| except (TypeError, ValueError): | |
| # If we cannot interpret the hf_device_map entry as a torch.device, | |
| # fall back to parameter-based device detection below. | |
| logger.warning( | |
| f"Could not interpret hf_device_map entry {first_device_raw!r} as a torch.device; " | |
| "falling back to model parameter device." | |
| ) |
| self.llm = AutoModelForCausalLM.from_pretrained( | ||
| model_path, | ||
| device_map="auto", | ||
| max_memory={0: "7GiB", "cpu": "32GiB"}, | ||
| torch_dtype=torch.bfloat16, | ||
| trust_remote_code=True |
Copilot
AI
Feb 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hard-codes device_map, max_memory, and torch_dtype=torch.bfloat16. On systems without CUDA (or GPUs without bf16 support), or when the application config expects self.dtype / manual offloading behavior, this can fail or change behavior unexpectedly. Consider deriving device_map/max_memory from runtime capabilities + user config (e.g., only pass GPU memory keys when CUDA is available; respect self.dtype; and keep manual offload mode working when requested).
| self.llm = AutoModelForCausalLM.from_pretrained( | |
| model_path, | |
| device_map="auto", | |
| max_memory={0: "7GiB", "cpu": "32GiB"}, | |
| torch_dtype=torch.bfloat16, | |
| trust_remote_code=True | |
| # Build loading configuration based on runtime capabilities and optional handler config | |
| load_kwargs: Dict[str, Any] = { | |
| "trust_remote_code": True, | |
| } | |
| # Respect an explicitly configured dtype on the handler if present | |
| configured_dtype = getattr(self, "dtype", None) | |
| if configured_dtype is not None: | |
| load_kwargs["torch_dtype"] = configured_dtype | |
| else: | |
| # Fallback: use bfloat16 only when supported; otherwise default dtype (e.g., float32) | |
| if torch.cuda.is_available() and getattr(torch.cuda, "is_bf16_supported", lambda: False)(): | |
| load_kwargs["torch_dtype"] = torch.bfloat16 | |
| # Only configure device_map / max_memory when CUDA is available and not explicitly on CPU | |
| if torch.cuda.is_available() and str(device).lower() != "cpu": | |
| load_kwargs["device_map"] = "auto" | |
| load_kwargs["max_memory"] = {0: "7GiB", "cpu": "32GiB"} | |
| self.llm = AutoModelForCausalLM.from_pretrained( | |
| model_path, | |
| **load_kwargs, |
| if 'batch_inputs_tokenized' in locals(): del batch_inputs_tokenized | ||
| if 'batch_input_ids' in locals(): del batch_input_ids | ||
| if 'outputs' in locals(): del outputs | ||
| torch.cuda.empty_cache() |
Copilot
AI
Feb 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling torch.cuda.empty_cache() unconditionally on every request can significantly degrade performance due to allocator churn; it’s usually better to reserve it for true OOM recovery paths or guarded low-VRAM modes. At minimum, gate this with torch.cuda.is_available() and consider making the aggressive cleanup optional/configurable (especially if this runs in a hot path).
| torch.cuda.empty_cache() | |
| if torch.cuda.is_available(): | |
| torch.cuda.empty_cache() |
| #if not self.offload_to_cpu: | ||
| # self.model = self.model.to(device).to(self.dtype) | ||
| #else: | ||
| # # If offload_to_cpu is True, check if we should keep DiT on GPU | ||
| # if not self.offload_dit_to_cpu: | ||
| # logger.info(f"[initialize_service] Keeping main model on {device} (persistent)") | ||
| # self.model = self.model.to(device).to(self.dtype) | ||
| # else: | ||
| # self.model = self.model.to("cpu").to(self.dtype) | ||
| #self.model.eval() |
Copilot
AI
Feb 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.model.eval() is now commented out, which leaves the model in training mode by default after from_pretrained(). That can change inference behavior (e.g., dropout, layernorm stats depending on model) and is a correctness issue. Ensure self.model.eval() is called after loading (and also after any compile/wrapping that might alter mode).
| if hasattr(model, "hf_device_map") and model.hf_device_map: | ||
| # Still ensure silence_latent and other ancillary tensors are on device if needed | ||
| if model_name == "model" and hasattr(self, "silence_latent"): | ||
| self.silence_latent = self.silence_latent.to(self.device).to(self.dtype) | ||
| yield |
Copilot
AI
Feb 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 695 appears mis-indented relative to the if model_name == ... block, which would cause an IndentationError at import/runtime. Fix the indentation so the assignment is correctly nested under the conditional.
| "available_lm_models": ["acestep-5Hz-lm-0.6B", "acestep-5Hz-lm-1.7B", "acestep-5Hz-lm-4B"], | ||
| "lm_memory_gb": {"0.6B": 3, "1.7B": 8, "4B": 12}, |
Copilot
AI
Feb 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tier3 is labeled 6–8GB, but it now advertises the 4B LM while also stating it needs 12GB (lm_memory_gb['4B']=12). This is internally inconsistent and can confuse any selection/guard logic that relies on these fields. Consider either (a) moving 4B to tiers where the memory requirement matches, or (b) updating lm_memory_gb and/or documenting that offloading enables 4B on 8GB so the config remains coherent.
| "available_lm_models": ["acestep-5Hz-lm-0.6B", "acestep-5Hz-lm-1.7B", "acestep-5Hz-lm-4B"], | |
| "lm_memory_gb": {"0.6B": 3, "1.7B": 8, "4B": 12}, | |
| "available_lm_models": ["acestep-5Hz-lm-0.6B", "acestep-5Hz-lm-1.7B"], | |
| "lm_memory_gb": {"0.6B": 3, "1.7B": 8}, |
| #auto_offload = gpu_memory_gb > 0 and gpu_memory_gb < 16 | ||
| auto_offload = False |
Copilot
AI
Feb 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change disables auto_offload for all GPUs, including low-VRAM setups, which can directly undermine the PR’s goal of fitting on 8GB. If disabling auto-offload is only for debugging, consider keeping the original heuristic and gating the override behind an explicit CLI flag/env var so production behavior doesn’t regress.
| #auto_offload = gpu_memory_gb > 0 and gpu_memory_gb < 16 | |
| auto_offload = False | |
| # Default heuristic: enable auto offload for GPUs with >0 and <16GB VRAM | |
| auto_offload = gpu_memory_gb > 0 and gpu_memory_gb < 16 | |
| # Optional override via environment variable (for debugging/experiments) | |
| # ACESTEP_AUTO_OFFLOAD can be set to: 1/true/yes/on or 0/false/no/off | |
| _auto_offload_env = os.getenv("ACESTEP_AUTO_OFFLOAD") | |
| if _auto_offload_env is not None: | |
| _val = _auto_offload_env.strip().lower() | |
| if _val in ("1", "true", "yes", "on"): | |
| auto_offload = True | |
| elif _val in ("0", "false", "no", "off"): | |
| auto_offload = False |
No description provided.