Skip to content

Conversation

@nalexand
Copy link

@nalexand nalexand commented Feb 7, 2026

No description provided.

nalexand and others added 4 commits February 7, 2026 04:46
@nalexand
Copy link
Author

nalexand commented Feb 7, 2026

Tested on 8 Gb VRAM all models 0.6b 1.7b 4b, quality not very good, problem with loudness (sometimes to low but most time to loud over limit), and sound not clean, not production quality. HeartMula do better quality but less controlable, tags not work.

@ChuxiJ
Copy link
Contributor

ChuxiJ commented Feb 7, 2026

If you aren't getting better results than HeartMula, you must be using it incorrectly, or there’s an environment issue causing bugs in generation. HeartMula is genuinely terrible.

@ChuxiJ ChuxiJ requested a review from Copilot February 7, 2026 07:11
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds faster LLM startup and lower-VRAM operation by caching the tokenizer and supporting LLM unload/reload plus device-map based offloading to help the 4B 5Hz LM fit on 8GB VRAM.

Changes:

  • Add gzip+pickle tokenizer cache to drastically reduce tokenizer load time on subsequent runs.
  • Switch LLM/DiT loading to device_map="auto" offloading patterns and move inputs to the model’s actual device.
  • Add LLM unload/reload flow and adjust UI/config defaults aimed at 8GB VRAM usage.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
acestep/llm_inference.py Tokenizer caching + LLM device-map loading, device-aware forward pass, unload/reload lifecycle.
acestep/inference.py Reload LLM on-demand and unload it before DiT/VAE to free VRAM.
acestep/handler.py Add optional Accelerate offloading for the main model; device-aware decoding and autocast tweaks.
acestep/gradio_ui/events/results_handlers.py Change default batch size from 2 to 1 in batch navigation/storage.
acestep/gradio_ui/events/generation_handlers.py Add one more gr.update() output to keep UI outputs aligned.
acestep/gpu_config.py Expand available LM models (incl. 4B) and adjust batch-size defaults per GPU tier.
acestep/acestep_v15_pipeline.py Disable automatic offloading unconditionally.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +124 to +126
with gzip.open(cache_path, "rb") as f:
tokenizer = pickle.load(f)
load_time = time.time() - start_time
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading a tokenizer via pickle.load() from a file on disk is unsafe because pickle is code-executable if the cache file is ever tampered with. Prefer a non-executable cache format (e.g., tokenizer.save_pretrained(cache_dir) + AutoTokenizer.from_pretrained(cache_dir)), or at minimum add strong safeguards (e.g., opt-in flag, verify cache file ownership/permissions, and handle/avoid untrusted paths).

Copilot uses AI. Check for mistakes.
Comment on lines +380 to +381
first_device = next(iter(model.hf_device_map.values()))
return torch.device(first_device)
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hf_device_map values are not guaranteed to be directly accepted by torch.device(...) (they may be ints like 0, or strings like 'disk'). This can raise at runtime. Consider normalizing common cases (e.g., int -> f\"cuda:{i}\", 'disk' -> 'cpu') and falling back safely when the value is not a valid torch.device spec.

Suggested change
first_device = next(iter(model.hf_device_map.values()))
return torch.device(first_device)
first_device_raw = next(iter(model.hf_device_map.values()))
# Normalize common hf_device_map values so they are acceptable to torch.device
normalized_device: Union[str, int]
if isinstance(first_device_raw, int):
# Interpret integers as CUDA device indices when available, otherwise fall back to CPU
if torch.cuda.is_available():
normalized_device = f"cuda:{first_device_raw}"
else:
normalized_device = "cpu"
elif isinstance(first_device_raw, str):
if first_device_raw == "disk":
# 'disk' is not a valid torch.device; default to CPU for computations
normalized_device = "cpu"
else:
normalized_device = first_device_raw
else:
# Fallback: rely on string representation, may still fail below
normalized_device = str(first_device_raw)
try:
return torch.device(normalized_device)
except (TypeError, ValueError):
# If we cannot interpret the hf_device_map entry as a torch.device,
# fall back to parameter-based device detection below.
logger.warning(
f"Could not interpret hf_device_map entry {first_device_raw!r} as a torch.device; "
"falling back to model parameter device."
)

Copilot uses AI. Check for mistakes.
Comment on lines +316 to +321
self.llm = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="auto",
max_memory={0: "7GiB", "cpu": "32GiB"},
torch_dtype=torch.bfloat16,
trust_remote_code=True
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hard-codes device_map, max_memory, and torch_dtype=torch.bfloat16. On systems without CUDA (or GPUs without bf16 support), or when the application config expects self.dtype / manual offloading behavior, this can fail or change behavior unexpectedly. Consider deriving device_map/max_memory from runtime capabilities + user config (e.g., only pass GPU memory keys when CUDA is available; respect self.dtype; and keep manual offload mode working when requested).

Suggested change
self.llm = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="auto",
max_memory={0: "7GiB", "cpu": "32GiB"},
torch_dtype=torch.bfloat16,
trust_remote_code=True
# Build loading configuration based on runtime capabilities and optional handler config
load_kwargs: Dict[str, Any] = {
"trust_remote_code": True,
}
# Respect an explicitly configured dtype on the handler if present
configured_dtype = getattr(self, "dtype", None)
if configured_dtype is not None:
load_kwargs["torch_dtype"] = configured_dtype
else:
# Fallback: use bfloat16 only when supported; otherwise default dtype (e.g., float32)
if torch.cuda.is_available() and getattr(torch.cuda, "is_bf16_supported", lambda: False)():
load_kwargs["torch_dtype"] = torch.bfloat16
# Only configure device_map / max_memory when CUDA is available and not explicitly on CPU
if torch.cuda.is_available() and str(device).lower() != "cpu":
load_kwargs["device_map"] = "auto"
load_kwargs["max_memory"] = {0: "7GiB", "cpu": "32GiB"}
self.llm = AutoModelForCausalLM.from_pretrained(
model_path,
**load_kwargs,

Copilot uses AI. Check for mistakes.
if 'batch_inputs_tokenized' in locals(): del batch_inputs_tokenized
if 'batch_input_ids' in locals(): del batch_input_ids
if 'outputs' in locals(): del outputs
torch.cuda.empty_cache()
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling torch.cuda.empty_cache() unconditionally on every request can significantly degrade performance due to allocator churn; it’s usually better to reserve it for true OOM recovery paths or guarded low-VRAM modes. At minimum, gate this with torch.cuda.is_available() and consider making the aggressive cleanup optional/configurable (especially if this runs in a hot path).

Suggested change
torch.cuda.empty_cache()
if torch.cuda.is_available():
torch.cuda.empty_cache()

Copilot uses AI. Check for mistakes.
Comment on lines +438 to +447
#if not self.offload_to_cpu:
# self.model = self.model.to(device).to(self.dtype)
#else:
# # If offload_to_cpu is True, check if we should keep DiT on GPU
# if not self.offload_dit_to_cpu:
# logger.info(f"[initialize_service] Keeping main model on {device} (persistent)")
# self.model = self.model.to(device).to(self.dtype)
# else:
# self.model = self.model.to("cpu").to(self.dtype)
#self.model.eval()
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.model.eval() is now commented out, which leaves the model in training mode by default after from_pretrained(). That can change inference behavior (e.g., dropout, layernorm stats depending on model) and is a correctness issue. Ensure self.model.eval() is called after loading (and also after any compile/wrapping that might alter mode).

Copilot uses AI. Check for mistakes.
Comment on lines +692 to +696
if hasattr(model, "hf_device_map") and model.hf_device_map:
# Still ensure silence_latent and other ancillary tensors are on device if needed
if model_name == "model" and hasattr(self, "silence_latent"):
self.silence_latent = self.silence_latent.to(self.device).to(self.dtype)
yield
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 695 appears mis-indented relative to the if model_name == ... block, which would cause an IndentationError at import/runtime. Fix the indentation so the assignment is correctly nested under the conditional.

Copilot uses AI. Check for mistakes.
Comment on lines +71 to +72
"available_lm_models": ["acestep-5Hz-lm-0.6B", "acestep-5Hz-lm-1.7B", "acestep-5Hz-lm-4B"],
"lm_memory_gb": {"0.6B": 3, "1.7B": 8, "4B": 12},
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tier3 is labeled 6–8GB, but it now advertises the 4B LM while also stating it needs 12GB (lm_memory_gb['4B']=12). This is internally inconsistent and can confuse any selection/guard logic that relies on these fields. Consider either (a) moving 4B to tiers where the memory requirement matches, or (b) updating lm_memory_gb and/or documenting that offloading enables 4B on 8GB so the config remains coherent.

Suggested change
"available_lm_models": ["acestep-5Hz-lm-0.6B", "acestep-5Hz-lm-1.7B", "acestep-5Hz-lm-4B"],
"lm_memory_gb": {"0.6B": 3, "1.7B": 8, "4B": 12},
"available_lm_models": ["acestep-5Hz-lm-0.6B", "acestep-5Hz-lm-1.7B"],
"lm_memory_gb": {"0.6B": 3, "1.7B": 8},

Copilot uses AI. Check for mistakes.
Comment on lines +94 to +95
#auto_offload = gpu_memory_gb > 0 and gpu_memory_gb < 16
auto_offload = False
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change disables auto_offload for all GPUs, including low-VRAM setups, which can directly undermine the PR’s goal of fitting on 8GB. If disabling auto-offload is only for debugging, consider keeping the original heuristic and gating the override behind an explicit CLI flag/env var so production behavior doesn’t regress.

Suggested change
#auto_offload = gpu_memory_gb > 0 and gpu_memory_gb < 16
auto_offload = False
# Default heuristic: enable auto offload for GPUs with >0 and <16GB VRAM
auto_offload = gpu_memory_gb > 0 and gpu_memory_gb < 16
# Optional override via environment variable (for debugging/experiments)
# ACESTEP_AUTO_OFFLOAD can be set to: 1/true/yes/on or 0/false/no/off
_auto_offload_env = os.getenv("ACESTEP_AUTO_OFFLOAD")
if _auto_offload_env is not None:
_val = _auto_offload_env.strip().lower()
if _val in ("1", "true", "yes", "on"):
auto_offload = True
elif _val in ("0", "false", "no", "off"):
auto_offload = False

Copilot uses AI. Check for mistakes.
@ChuxiJ ChuxiJ closed this Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants