Add tokenizer cache, and offloading for acestep-5Hz-lm-4B to fit on 8GB vram #265

nalexand · 2026-02-07T06:31:49Z

No description provided.

Add tokenizer cache, and offloading for acestep-5Hz-lm-4B to fit on 8GB vram

nalexand · 2026-02-07T06:39:18Z

Tested on 8 Gb VRAM all models 0.6b 1.7b 4b, quality not very good, problem with loudness (sometimes to low but most time to loud over limit), and sound not clean, not production quality. HeartMula do better quality but less controlable, tags not work.

ChuxiJ · 2026-02-07T07:07:19Z

If you aren't getting better results than HeartMula, you must be using it incorrectly, or there’s an environment issue causing bugs in generation. HeartMula is genuinely terrible.

Copilot

Pull request overview

Adds faster LLM startup and lower-VRAM operation by caching the tokenizer and supporting LLM unload/reload plus device-map based offloading to help the 4B 5Hz LM fit on 8GB VRAM.

Changes:

Add gzip+pickle tokenizer cache to drastically reduce tokenizer load time on subsequent runs.
Switch LLM/DiT loading to device_map="auto" offloading patterns and move inputs to the model’s actual device.
Add LLM unload/reload flow and adjust UI/config defaults aimed at 8GB VRAM usage.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
acestep/llm_inference.py	Tokenizer caching + LLM device-map loading, device-aware forward pass, unload/reload lifecycle.
acestep/inference.py	Reload LLM on-demand and unload it before DiT/VAE to free VRAM.
acestep/handler.py	Add optional Accelerate offloading for the main model; device-aware decoding and autocast tweaks.
acestep/gradio_ui/events/results_handlers.py	Change default batch size from 2 to 1 in batch navigation/storage.
acestep/gradio_ui/events/generation_handlers.py	Add one more `gr.update()` output to keep UI outputs aligned.
acestep/gpu_config.py	Expand available LM models (incl. 4B) and adjust batch-size defaults per GPU tier.
acestep/acestep_v15_pipeline.py	Disable automatic offloading unconditionally.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-07T07:13:47Z

acestep/llm_inference.py

+                with gzip.open(cache_path, "rb") as f:
+                    tokenizer = pickle.load(f)
+                load_time = time.time() - start_time


Loading a tokenizer via pickle.load() from a file on disk is unsafe because pickle is code-executable if the cache file is ever tampered with. Prefer a non-executable cache format (e.g., tokenizer.save_pretrained(cache_dir) + AutoTokenizer.from_pretrained(cache_dir)), or at minimum add strong safeguards (e.g., opt-in flag, verify cache file ownership/permissions, and handle/avoid untrusted paths).

Copilot · 2026-02-07T07:13:47Z

acestep/llm_inference.py

+            first_device = next(iter(model.hf_device_map.values()))
+            return torch.device(first_device)


hf_device_map values are not guaranteed to be directly accepted by torch.device(...) (they may be ints like 0, or strings like 'disk'). This can raise at runtime. Consider normalizing common cases (e.g., int -> f\"cuda:{i}\", 'disk' -> 'cpu') and falling back safely when the value is not a valid torch.device spec.

Suggested change

first_device = next(iter(model.hf_device_map.values()))

return torch.device(first_device)

first_device_raw = next(iter(model.hf_device_map.values()))

# Normalize common hf_device_map values so they are acceptable to torch.device

normalized_device: Union[str, int]

if isinstance(first_device_raw, int):

# Interpret integers as CUDA device indices when available, otherwise fall back to CPU

if torch.cuda.is_available():

normalized_device = f"cuda:{first_device_raw}"

else:

normalized_device = "cpu"

elif isinstance(first_device_raw, str):

if first_device_raw == "disk":

# 'disk' is not a valid torch.device; default to CPU for computations

normalized_device = "cpu"

else:

normalized_device = first_device_raw

else:

# Fallback: rely on string representation, may still fail below

normalized_device = str(first_device_raw)

try:

return torch.device(normalized_device)

except (TypeError, ValueError):

# If we cannot interpret the hf_device_map entry as a torch.device,

# fall back to parameter-based device detection below.

logger.warning(

f"Could not interpret hf_device_map entry {first_device_raw!r} as a torch.device; "

"falling back to model parameter device."

)

Copilot · 2026-02-07T07:13:47Z

acestep/llm_inference.py

+            self.llm = AutoModelForCausalLM.from_pretrained(
+                model_path,
+                device_map="auto",
+                max_memory={0: "7GiB", "cpu": "32GiB"},
+                torch_dtype=torch.bfloat16,
+                trust_remote_code=True


This hard-codes device_map, max_memory, and torch_dtype=torch.bfloat16. On systems without CUDA (or GPUs without bf16 support), or when the application config expects self.dtype / manual offloading behavior, this can fail or change behavior unexpectedly. Consider deriving device_map/max_memory from runtime capabilities + user config (e.g., only pass GPU memory keys when CUDA is available; respect self.dtype; and keep manual offload mode working when requested).

Suggested change

self.llm = AutoModelForCausalLM.from_pretrained(

model_path,

device_map="auto",

max_memory={0: "7GiB", "cpu": "32GiB"},

torch_dtype=torch.bfloat16,

trust_remote_code=True

# Build loading configuration based on runtime capabilities and optional handler config

load_kwargs: Dict[str, Any] = {

"trust_remote_code": True,

}

# Respect an explicitly configured dtype on the handler if present

configured_dtype = getattr(self, "dtype", None)

if configured_dtype is not None:

load_kwargs["torch_dtype"] = configured_dtype

else:

# Fallback: use bfloat16 only when supported; otherwise default dtype (e.g., float32)

if torch.cuda.is_available() and getattr(torch.cuda, "is_bf16_supported", lambda: False)():

load_kwargs["torch_dtype"] = torch.bfloat16

# Only configure device_map / max_memory when CUDA is available and not explicitly on CPU

if torch.cuda.is_available() and str(device).lower() != "cpu":

load_kwargs["device_map"] = "auto"

load_kwargs["max_memory"] = {0: "7GiB", "cpu": "32GiB"}

self.llm = AutoModelForCausalLM.from_pretrained(

model_path,

**load_kwargs,

Copilot · 2026-02-07T07:13:48Z

acestep/llm_inference.py

+        if 'batch_inputs_tokenized' in locals(): del batch_inputs_tokenized
+        if 'batch_input_ids' in locals(): del batch_input_ids
+        if 'outputs' in locals(): del outputs
+        torch.cuda.empty_cache()


Calling torch.cuda.empty_cache() unconditionally on every request can significantly degrade performance due to allocator churn; it’s usually better to reserve it for true OOM recovery paths or guarded low-VRAM modes. At minimum, gate this with torch.cuda.is_available() and consider making the aggressive cleanup optional/configurable (especially if this runs in a hot path).

Suggested change

torch.cuda.empty_cache()

if torch.cuda.is_available():

torch.cuda.empty_cache()

Copilot · 2026-02-07T07:13:48Z

acestep/handler.py

+                #if not self.offload_to_cpu:
+                #    self.model = self.model.to(device).to(self.dtype)
+                #else:
+                #    # If offload_to_cpu is True, check if we should keep DiT on GPU
+                #    if not self.offload_dit_to_cpu:
+                #        logger.info(f"[initialize_service] Keeping main model on {device} (persistent)")
+                #        self.model = self.model.to(device).to(self.dtype)
+                #    else:
+                #        self.model = self.model.to("cpu").to(self.dtype)
+                #self.model.eval()


self.model.eval() is now commented out, which leaves the model in training mode by default after from_pretrained(). That can change inference behavior (e.g., dropout, layernorm stats depending on model) and is a correctness issue. Ensure self.model.eval() is called after loading (and also after any compile/wrapping that might alter mode).

Copilot · 2026-02-07T07:13:48Z

acestep/handler.py

+        if hasattr(model, "hf_device_map") and model.hf_device_map:
+            # Still ensure silence_latent and other ancillary tensors are on device if needed
+            if model_name == "model" and hasattr(self, "silence_latent"):
+                 self.silence_latent = self.silence_latent.to(self.device).to(self.dtype)
+            yield


Line 695 appears mis-indented relative to the if model_name == ... block, which would cause an IndentationError at import/runtime. Fix the indentation so the assignment is correctly nested under the conditional.

Copilot · 2026-02-07T07:13:48Z

acestep/gpu_config.py

+        "available_lm_models": ["acestep-5Hz-lm-0.6B", "acestep-5Hz-lm-1.7B", "acestep-5Hz-lm-4B"],
+        "lm_memory_gb": {"0.6B": 3, "1.7B": 8, "4B": 12},


Tier3 is labeled 6–8GB, but it now advertises the 4B LM while also stating it needs 12GB (lm_memory_gb['4B']=12). This is internally inconsistent and can confuse any selection/guard logic that relies on these fields. Consider either (a) moving 4B to tiers where the memory requirement matches, or (b) updating lm_memory_gb and/or documenting that offloading enables 4B on 8GB so the config remains coherent.

Suggested change

"available_lm_models": ["acestep-5Hz-lm-0.6B", "acestep-5Hz-lm-1.7B", "acestep-5Hz-lm-4B"],

"lm_memory_gb": {"0.6B": 3, "1.7B": 8, "4B": 12},

"available_lm_models": ["acestep-5Hz-lm-0.6B", "acestep-5Hz-lm-1.7B"],

"lm_memory_gb": {"0.6B": 3, "1.7B": 8},

Copilot · 2026-02-07T07:13:48Z

acestep/acestep_v15_pipeline.py

+    #auto_offload = gpu_memory_gb > 0 and gpu_memory_gb < 16
+    auto_offload = False


This change disables auto_offload for all GPUs, including low-VRAM setups, which can directly undermine the PR’s goal of fitting on 8GB. If disabling auto-offload is only for debugging, consider keeping the original heuristic and gating the override behind an explicit CLI flag/env var so production behavior doesn’t regress.

Suggested change

#auto_offload = gpu_memory_gb > 0 and gpu_memory_gb < 16

auto_offload = False

# Default heuristic: enable auto offload for GPUs with >0 and <16GB VRAM

auto_offload = gpu_memory_gb > 0 and gpu_memory_gb < 16

# Optional override via environment variable (for debugging/experiments)

# ACESTEP_AUTO_OFFLOAD can be set to: 1/true/yes/on or 0/false/no/off

_auto_offload_env = os.getenv("ACESTEP_AUTO_OFFLOAD")

if _auto_offload_env is not None:

_val = _auto_offload_env.strip().lower()

if _val in ("1", "true", "yes", "on"):

auto_offload = True

elif _val in ("0", "false", "no", "off"):

auto_offload = False

nalexand and others added 4 commits February 7, 2026 04:46

Use accelerate to partialy offload models

76a3881

fix offload llm

df9ff09

small fixes

4117c49

Merge pull request #1 from nalexand/optimization

cfc8206

Add tokenizer cache, and offloading for acestep-5Hz-lm-4B to fit on 8GB vram

ChuxiJ requested a review from Copilot February 7, 2026 07:11

Copilot AI reviewed Feb 7, 2026

View reviewed changes

ChuxiJ closed this Feb 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tokenizer cache, and offloading for acestep-5Hz-lm-4B to fit on 8GB vram #265

Add tokenizer cache, and offloading for acestep-5Hz-lm-4B to fit on 8GB vram #265

Uh oh!

nalexand commented Feb 7, 2026

Uh oh!

nalexand commented Feb 7, 2026 •

edited

Loading

Uh oh!

ChuxiJ commented Feb 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		first_device = next(iter(model.hf_device_map.values()))
		return torch.device(first_device)

-            first_device = next(iter(model.hf_device_map.values()))
-            return torch.device(first_device)
+            first_device_raw = next(iter(model.hf_device_map.values()))
+            # Normalize common hf_device_map values so they are acceptable to torch.device
+            normalized_device: Union[str, int]
+            if isinstance(first_device_raw, int):
+                # Interpret integers as CUDA device indices when available, otherwise fall back to CPU
+                if torch.cuda.is_available():
+                    normalized_device = f"cuda:{first_device_raw}"
+                else:
+                    normalized_device = "cpu"
+            elif isinstance(first_device_raw, str):
+                if first_device_raw == "disk":
+                    # 'disk' is not a valid torch.device; default to CPU for computations
+                    normalized_device = "cpu"
+                else:
+                    normalized_device = first_device_raw
+            else:
+                # Fallback: rely on string representation, may still fail below
+                normalized_device = str(first_device_raw)
+            try:
+                return torch.device(normalized_device)
+            except (TypeError, ValueError):
+                # If we cannot interpret the hf_device_map entry as a torch.device,
+                # fall back to parameter-based device detection below.
+                logger.warning(
+                    f"Could not interpret hf_device_map entry {first_device_raw!r} as a torch.device; "
+                    "falling back to model parameter device."
+                )

-            self.llm = AutoModelForCausalLM.from_pretrained(
-                model_path,
-                device_map="auto",
-                max_memory={0: "7GiB", "cpu": "32GiB"},
-                torch_dtype=torch.bfloat16,
-                trust_remote_code=True
+            # Build loading configuration based on runtime capabilities and optional handler config
+            load_kwargs: Dict[str, Any] = {
+                "trust_remote_code": True,
+            }
+            # Respect an explicitly configured dtype on the handler if present
+            configured_dtype = getattr(self, "dtype", None)
+            if configured_dtype is not None:
+                load_kwargs["torch_dtype"] = configured_dtype
+            else:
+                # Fallback: use bfloat16 only when supported; otherwise default dtype (e.g., float32)
+                if torch.cuda.is_available() and getattr(torch.cuda, "is_bf16_supported", lambda: False)():
+                    load_kwargs["torch_dtype"] = torch.bfloat16
+            # Only configure device_map / max_memory when CUDA is available and not explicitly on CPU
+            if torch.cuda.is_available() and str(device).lower() != "cpu":
+                load_kwargs["device_map"] = "auto"
+                load_kwargs["max_memory"] = {0: "7GiB", "cpu": "32GiB"}
+            self.llm = AutoModelForCausalLM.from_pretrained(
+                model_path,
+                **load_kwargs,

	torch.cuda.empty_cache()
	if torch.cuda.is_available():
	torch.cuda.empty_cache()

		"available_lm_models": ["acestep-5Hz-lm-0.6B", "acestep-5Hz-lm-1.7B", "acestep-5Hz-lm-4B"],
		"lm_memory_gb": {"0.6B": 3, "1.7B": 8, "4B": 12},

		#auto_offload = gpu_memory_gb > 0 and gpu_memory_gb < 16
		auto_offload = False

-    #auto_offload = gpu_memory_gb > 0 and gpu_memory_gb < 16
-    auto_offload = False
+    # Default heuristic: enable auto offload for GPUs with >0 and <16GB VRAM
+    auto_offload = gpu_memory_gb > 0 and gpu_memory_gb < 16
+    # Optional override via environment variable (for debugging/experiments)
+    # ACESTEP_AUTO_OFFLOAD can be set to: 1/true/yes/on or 0/false/no/off
+    _auto_offload_env = os.getenv("ACESTEP_AUTO_OFFLOAD")
+    if _auto_offload_env is not None:
+        _val = _auto_offload_env.strip().lower()
+        if _val in ("1", "true", "yes", "on"):
+            auto_offload = True
+        elif _val in ("0", "false", "no", "off"):
+            auto_offload = False

Add tokenizer cache, and offloading for acestep-5Hz-lm-4B to fit on 8GB vram #265

Add tokenizer cache, and offloading for acestep-5Hz-lm-4B to fit on 8GB vram #265

Uh oh!

Conversation

nalexand commented Feb 7, 2026

Uh oh!

nalexand commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChuxiJ commented Feb 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nalexand commented Feb 7, 2026 •

edited

Loading