-
Notifications
You must be signed in to change notification settings - Fork 549
feat: GPU compatibility tier system with boundary testing #417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
## Summary Add a comprehensive GPU tier configuration system and boundary testing framework to determine minimum VRAM requirements for different optimization levels. ## GPU Tier System (acestep/gpu_config.py) - Add GPUConfig dataclass with per-tier settings for quantization, offload, LM models, batch size, and duration limits - Implement automatic GPU tier detection based on available VRAM - Support VRAM simulation via MAX_CUDA_VRAM environment variable with hard VRAM cap enforcement using torch memory fraction - Define tiers: tier1(4GB), tier2(6GB), tier3(8GB), tier4(12GB), tier5(14GB), tier6a(16GB), tier6b(24GB), unlimited(48GB+) ## Boundary Testing (profile_inference.py) - Add --tier-boundary flag to tier-test mode for automated boundary analysis across all VRAM tiers - Refactor tier test logic into reusable _run_single_tier_test() - Test three variants per tier: default, no-quant, no-offload - Smart skipping when tier already disables the tested optimization - Add _print_boundary_summary() with clear results table ## Boundary Test Results - No INT8 Quantization: minimum tier2 (6GB), peak 4.91GB - No CPU Offload: minimum tier3 (8GB), peak 7.30GB ## Handler & UI Updates - Enhanced model offload/load context management in handler.py - Updated Gradio UI to expose GPU tier settings - Updated API server for tier-aware configuration - Improved nano-vllm model runner compatibility ## Documentation - Updated GPU_COMPATIBILITY docs (en/zh/ja/ko) - Updated BENCHMARK docs (en/zh) with tier-boundary CLI reference - Updated INFERENCE, INSTALL, GRADIO_GUIDE docs across all languages - Updated README with GPU tier information
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds extensive VRAM-aware GPU tiering, adaptive GPUConfig defaults (LM, backend, offload, quantization), LM selection/downgrade rules, VAE/decoding VRAM fallbacks, VRAM simulation and profiling tools (tier-test, profile_vram), Gradio UI tier-aware defaults/limits, and broad docs updates. No model algorithm changes. Changes
Sequence Diagram(s)sequenceDiagram
participant UI as Gradio UI
participant API as API Server / Pipeline
participant GPU as gpu_config
participant Disk as Disk models
participant LM as LM Loader
participant VAE as VAE/DiT runtime
UI->>API: init request (init_params)
API->>GPU: probe get_gpu_memory_gb(), get_gpu_tier()
GPU-->>API: GPUConfig (recommended_lm, backend, offload, quantization, limits)
API->>Disk: find_best_lm_model_on_disk(recommended_lm)
Disk-->>API: chosen_model or none
API->>LM: attempt LM init (selected model, backend)
alt LM too large or backend restricted
LM-->>API: fail / downgrade -> API disables LM or selects smaller model
end
API->>UI: return tier-derived UI updates (duration, batch, warnings)
UI->>API: start generation
API->>GPU: estimate_inference_vram()
alt estimated > available
API->>API: _vram_guard_reduce_batch -> adjust batch/duration
API->>VAE: use adaptive chunk size or perform _decode_on_cpu on OOM
end
API-->>UI: generation results or OOM diagnostics
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
acestep/third_parts/nano-vllm/nanovllm/engine/model_runner.py (1)
247-260:⚠️ Potential issue | 🟠 MajorReserve can be bypassed when
available_for_kv_cache <= 0.
With MAX_CUDA_VRAM simulation or highcurrentusage, the fallback (free * 0.5) plusmax(1, …)still allocates KV cache, defeating the 1 GB reserve and risking immediate OOM. Consider clamping to zero and letting the existing guard raise when there’s no headroom.🛠️ Proposed fix to respect the reserve and fail fast
- if available_for_kv_cache <= 0: - available_for_kv_cache = free * 0.5 # Fallback to 50% of free memory - - config.num_kvcache_blocks = max(1, int(available_for_kv_cache) // block_bytes) - if config.num_kvcache_blocks <= 0: + if available_for_kv_cache <= 0: + available_for_kv_cache = 0 + + config.num_kvcache_blocks = int(available_for_kv_cache) // block_bytes + if config.num_kvcache_blocks <= 0: raise RuntimeError( f"Insufficient GPU memory for KV cache. " f"Free: {free / 1024**3:.2f} GB, Current: {current / 1024**3:.2f} GB, " f"Available for KV: {available_for_kv_cache / 1024**3:.2f} GB, " f"Block size: {block_bytes / 1024**2:.2f} MB" )
🤖 Fix all issues with AI agents
In `@acestep/handler.py`:
- Around line 1608-1617: The VRAM guard in _vram_guard_reduce_batch is checking
self.config_path which initialize_service never sets, so base-model detection
never triggers; update the check to use the existing config object (e.g.,
self.config) instead—inspect self.config.is_turbo or other fields on self.config
to determine base vs turbo and multiply per_sample_gb by 2.0 when appropriate;
ensure this logic is applied where per_sample_gb is computed in
_vram_guard_reduce_batch and remove or stop relying on self.config_path, or set
self.config_path during initialize_service if you prefer that pattern.
- Around line 3709-3718: The VRAM auto-check erroneously runs on non-CUDA
backends (MPS/XPU) because get_effective_free_vram_gb() returns 0 when
torch.cuda.is_available() is false, forcing VAE decode to CPU; change the logic
in the generate_music VAE decision block to only call
get_effective_free_vram_gb() and apply the _effective_free < 0.5 gate when
torch.cuda.is_available() is true (i.e., wrap the effective-free-VRAM check in a
cuda-available conditional), while preserving the ACESTEP_VAE_ON_CPU env
override and the _vae_cpu variable behavior so only CUDA devices can auto-enable
CPU VAE decode.
In `@acestep/third_parts/nano-vllm/nanovllm/engine/model_runner.py`:
- Around line 269-282: The f-string log in model_runner.py uses a Unicode
multiplication sign (×) which triggers RUF001 and can cause copy/paste/terminal
issues; update the print statement that formats KV cache info (the one
referencing config.num_kvcache_blocks, self.block_size, max_tokens_capacity,
kv_cache_size_gb, free, current, target_total_usage, block_bytes, post_kv_free)
to replace the Unicode "×" with a plain ASCII "x" character so the message
becomes e.g. "{config.num_kvcache_blocks} blocks x {self.block_size} tokens =
..." while keeping the rest of the formatting unchanged.
In `@docs/en/ace_step_musicians_guide.md`:
- Around line 157-160: Update the enthusiast tier entry so its batch-size range
follows the tier progression: locate the line containing "16-20 GB (enthusiast)"
and the phrase "1-4 songs at a time" and change it to "2-4 songs at a time"
(keeping the rest of the text, e.g., "Songs up to 10 minutes" and "Larger
Songwriter brain (1.7B)" unchanged) so the lower bound is consistent with the
8-12GB and 12-16GB tiers.
In `@docs/en/BENCHMARK.md`:
- Around line 160-223: The sample output code fence under the "tier-test"
section (the "TIER TEST RESULTS" block) lacks a language tag; update the opening
fence from ``` to ```text to satisfy MD040 and Markdown linting, leaving the
fence contents and closing ``` unchanged so the block is explicitly marked as
plain text.
In `@docs/zh/BENCHMARK.md`:
- Around line 187-199: The fenced code block that begins with "TIER TEST
RESULTS" is missing a language specifier; update the opening fence from ``` to
```text (or another appropriate spec like ```console) so syntax highlighters and
accessibility tools recognize it—modify the code block in the
docs/zh/BENCHMARK.md content around the "TIER TEST RESULTS" section to include
the language tag on the opening backticks.
In `@profile_inference.py`:
- Line 1072: The print call using an unnecessary f-string should be changed to a
regular string: locate the statement print(f"\n --- Variant: default ---") in
profile_inference.py and remove the leading f so it becomes print("\n ---
Variant: default ---"); no other behavior changes are needed.
- Around line 1092-1109: Fix the unnecessary f-string prefixes on print
statements that have no interpolations: replace print(f"...") with print("...")
for the messages around the "no-quant" and "no-offload" variants in the block
that calls _run_single_tier_test; specifically update the print calls that
reference the no-quant and no-offload messages which use
gpu_config.quantization_default and gpu_config.offload_to_cpu_default to
determine skips so they are ordinary string literals instead of f-strings.
- Around line 850-853: Remove the now-unnecessary noqa by deleting the "# noqa:
F401" comment on the import flash_attn line in the try block (the import and the
subsequent use of use_flash_attention already satisfy linter rules), i.e.,
update the import flash_attn statement so it no longer includes the noqa
directive.
- Around line 1276-1286: The two print calls that use f-strings without
placeholders should be regular strings: in the block that references
capability_name, failing, and passing (the one that prints the boundary summary
and returns passing[0] if any), replace print(f" ❌ No tier passed this test.
All tested tiers failed.") and print(f" ⚠️ No test results available for this
capability.") with print(" ❌ No tier passed this test. All tested tiers
failed.") and print(" ⚠️ No test results available for this capability.")
respectively so the unnecessary f-string prefixes are removed.
In `@scripts/profile_vram.py`:
- Around line 318-321: The code builds encoder_path =
os.path.join(checkpoint_dir, "text_encoder") which doesn't match the runtime
checkpoint name (e.g., "Qwen3-Embedding-0.6B"), so profiling can skip the
encoder; update the logic in scripts/profile_vram.py around encoder_path to try
the runtime checkpoint name as a fallback (check for
os.path.exists(os.path.join(checkpoint_dir, "Qwen3-Embedding-0.6B")) if the
"text_encoder" path is missing) and only return {} after both attempts fail, or
prefer the runtime-named directory when present; ensure references to
encoder_path, checkpoint_dir and the literal names ("text_encoder",
"Qwen3-Embedding-0.6B") are used so the handler and this script align.
- Around line 165-183: The current DiT profiling only allocates and deletes
dummy tensors (noise, text_hidden, text_mask) and never executes the model, so
peak memory misses activation usage; replace the no-op block with a minimal
forward pass by calling the DiT model (e.g., model(noise, text_hidden,
text_mask) or model.forward(...)) inside the torch.inference_mode() context so
activations are allocated and measured, and when has_cfg is true duplicate the
inputs (noise_cfg, text_hidden_cfg, text_mask_cfg) and pass the doubled batch to
the model to simulate classifier-free guidance; alternatively, if you
intentionally only want to measure input allocation, rename peak_inference_gb to
peak_input_allocation_gb to reflect the narrower measurement.
🧹 Nitpick comments (7)
acestep/acestep_v15_pipeline.py (1)
212-223: String replacement for model downgrade is brittle.The model path downgrade using
replace("4B", "1.7B")assumes a specific naming pattern. If a model path contains "4B" elsewhere (e.g., in a directory name or version suffix), this could produce unexpected results.Consider using a more robust approach that validates the replacement actually targets the model size portion of the path:
🛡️ Suggested improvement
if args.lm_model_path and 0 < gpu_memory_gb < VRAM_AUTO_OFFLOAD_THRESHOLD_GB: if "4B" in args.lm_model_path: - # Downgrade to 1.7B if available - fallback = args.lm_model_path.replace("4B", "1.7B") + # Downgrade to 1.7B if available - only replace in model name portion + import re + # Match "4B" that appears to be a model size (preceded by - or lm-) + fallback = re.sub(r'(lm-|-)4B\b', r'\g<1>1.7B', args.lm_model_path) + if fallback == args.lm_model_path: + # Fallback didn't change anything meaningful, skip downgrade warning + fallback = Noneacestep/gradio_ui/events/generation_handlers.py (1)
517-548: Minor: Duplicateget_global_gpu_config()call.
get_global_gpu_config()is called at line 450 and again at line 518. Since the GPU config is a singleton that doesn't change during initialization, you could reuse the earlier reference.This is a very minor optimization and doesn't affect correctness.
docs/zh/GPU_COMPATIBILITY.md (1)
141-149: Add language specifier to fenced code block.The code block showing the boundary analysis output is missing a language specifier. Since this is plain text output, use
textorplaintextto satisfy the markdown linter.📝 Proposed fix
-``` +```text BOUNDARY ANALYSIS =================acestep/gradio_ui/interfaces/generation.py (1)
174-181: Consider disabling the LM checkbox for unsupported tiers.The info text warns that LM is unavailable for low-VRAM tiers, but the checkbox remains interactive. Consider setting
interactive=Falsewhengpu_config.available_lm_modelsis empty to prevent users from enabling a non-functional feature.♻️ Proposed enhancement
+ lm_interactive = bool(gpu_config.available_lm_models) init_llm_checkbox = gr.Checkbox( label=t("service.init_llm_label"), value=init_llm_value, info=lm_info_text, + interactive=lm_interactive, )profile_inference.py (1)
1126-1152: Consider logging exceptions during handler cleanup instead of silent pass.The
try-except-passpattern silently swallows all exceptions during cleanup. While cleanup should be resilient, logging at DEBUG level helps diagnose issues during development without cluttering normal output.♻️ Proposed enhancement
+import logging + +logger = logging.getLogger(__name__) + def _cleanup_handlers(dit_handler, llm_handler): """Clean up handlers and free GPU memory.""" try: if dit_handler is not None: if hasattr(dit_handler, 'model') and dit_handler.model is not None: dit_handler.model = None if hasattr(dit_handler, 'vae') and dit_handler.vae is not None: dit_handler.vae = None if hasattr(dit_handler, 'text_encoder') and dit_handler.text_encoder is not None: dit_handler.text_encoder = None del dit_handler - except Exception: - pass + except Exception as e: + logger.debug("DiT handler cleanup error (non-fatal): %s", e) try: if llm_handler is not None: if hasattr(llm_handler, 'llm') and llm_handler.llm is not None: llm_handler.llm = None del llm_handler - except Exception: - pass + except Exception as e: + logger.debug("LLM handler cleanup error (non-fatal): %s", e)acestep/handler.py (1)
1581-1586: Remove or use unuseduse_lmparameter.
use_lmis unused and triggers lint warnings. Either wire it into the estimate (LM overhead) or drop it from the signature.🛠️ Proposed fix (remove if unused)
- audio_duration: Optional[float] = None, - use_lm: bool = False, + audio_duration: Optional[float] = None,acestep/gpu_config.py (1)
792-805: Ensure adaptive recommended LM is actually available.
compute_adaptive_configpicksrecommended_lm_modelfrom tier defaults even when the VRAM-budgetedavailable_lm_modelslist is smaller. That can recommend a model that doesn’t fit the computed budget. Consider clamping to the largest available model when the tier default isn’t inavailable_lm_models.🛠️ Proposed fix
- return GPUConfig( + recommended_model = tier_config.get("recommended_lm_model", "") + if recommended_model not in available_lm_models: + recommended_model = available_lm_models[-1] if available_lm_models else "" + return GPUConfig( tier=tier, gpu_memory_gb=total_vram_gb, max_duration_with_lm=max_dur_lm, max_duration_without_lm=max_dur_no_lm, max_batch_size_with_lm=max_batch_with_lm, max_batch_size_without_lm=max_batch_no_lm, init_lm_default=bool(available_lm_models), available_lm_models=available_lm_models, - recommended_lm_model=tier_config.get("recommended_lm_model", available_lm_models[0] if available_lm_models else ""), + recommended_lm_model=recommended_model, lm_backend_restriction=tier_config.get("lm_backend_restriction", "all"), recommended_backend=tier_config.get("recommended_backend", "vllm"), offload_to_cpu_default=tier_config.get("offload_to_cpu_default", True), offload_dit_to_cpu_default=tier_config.get("offload_dit_to_cpu_default", True), quantization_default=tier_config.get("quantization_default", True), compile_model_default=tier_config.get("compile_model_default", True), lm_memory_gb=lm_memory_gb, )
| # Estimate per-sample activation cost for DiT | ||
| duration_sec = float(audio_duration) if audio_duration and float(audio_duration) > 0 else 60.0 | ||
| # Empirical: ~0.8 GB per sample at 60s, linear scaling | ||
| per_sample_gb = 0.8 * (duration_sec / 60.0) | ||
| # If using cfg (base model), double the per-sample cost | ||
| if hasattr(self, 'model') and self.model is not None: | ||
| model_name = getattr(self, 'config_path', '') or '' | ||
| if 'base' in model_name.lower(): | ||
| per_sample_gb *= 2.0 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Base-model detection in VRAM guard never triggers.
_vram_guard_reduce_batch checks self.config_path, but initialize_service never sets it. That means base models won’t double the per-sample estimate, so the guard can allow oversized batches and still OOM. Consider using self.config.is_turbo (or storing config_path during init) instead.
🛠️ Proposed fix (use config instead of config_path)
- if hasattr(self, 'model') and self.model is not None:
- model_name = getattr(self, 'config_path', '') or ''
- if 'base' in model_name.lower():
- per_sample_gb *= 2.0
+ if self.model is not None and self.config is not None:
+ if not getattr(self.config, "is_turbo", False):
+ per_sample_gb *= 2.0📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Estimate per-sample activation cost for DiT | |
| duration_sec = float(audio_duration) if audio_duration and float(audio_duration) > 0 else 60.0 | |
| # Empirical: ~0.8 GB per sample at 60s, linear scaling | |
| per_sample_gb = 0.8 * (duration_sec / 60.0) | |
| # If using cfg (base model), double the per-sample cost | |
| if hasattr(self, 'model') and self.model is not None: | |
| model_name = getattr(self, 'config_path', '') or '' | |
| if 'base' in model_name.lower(): | |
| per_sample_gb *= 2.0 | |
| # Estimate per-sample activation cost for DiT | |
| duration_sec = float(audio_duration) if audio_duration and float(audio_duration) > 0 else 60.0 | |
| # Empirical: ~0.8 GB per sample at 60s, linear scaling | |
| per_sample_gb = 0.8 * (duration_sec / 60.0) | |
| # If using cfg (base model), double the per-sample cost | |
| if self.model is not None and self.config is not None: | |
| if not getattr(self.config, "is_turbo", False): | |
| per_sample_gb *= 2.0 |
🤖 Prompt for AI Agents
In `@acestep/handler.py` around lines 1608 - 1617, The VRAM guard in
_vram_guard_reduce_batch is checking self.config_path which initialize_service
never sets, so base-model detection never triggers; update the check to use the
existing config object (e.g., self.config) instead—inspect self.config.is_turbo
or other fields on self.config to determine base vs turbo and multiply
per_sample_gb by 2.0 when appropriate; ensure this logic is applied where
per_sample_gb is computed in _vram_guard_reduce_batch and remove or stop relying
on self.config_path, or set self.config_path during initialize_service if you
prefer that pattern.
| # Check effective free VRAM and auto-enable CPU decode if extremely tight | ||
| import os as _os | ||
| _vae_cpu = _os.environ.get("ACESTEP_VAE_ON_CPU", "0").lower() in ("1", "true", "yes") | ||
| if not _vae_cpu: | ||
| _effective_free = get_effective_free_vram_gb() | ||
| logger.info(f"[generate_music] Effective free VRAM before VAE decode: {_effective_free:.2f} GB") | ||
| # If less than 0.5 GB free, VAE decode on GPU will almost certainly OOM | ||
| if _effective_free < 0.5: | ||
| logger.warning(f"[generate_music] Only {_effective_free:.2f} GB free VRAM — auto-enabling CPU VAE decode") | ||
| _vae_cpu = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Find the get_effective_free_vram_gb function definition
rg "def get_effective_free_vram_gb" --type py -A 15Repository: ace-step/ACE-Step-1.5
Length of output: 1003
🏁 Script executed:
# Also search for where this function is imported or used
rg "get_effective_free_vram_gb" --type py | head -20Repository: ace-step/ACE-Step-1.5
Length of output: 733
🏁 Script executed:
# Check if there are device-specific memory utilities (e.g., for MPS)
rg "mps|MPS" acestep/handler.py -B 2 -A 2 | head -50Repository: ace-step/ACE-Step-1.5
Length of output: 3029
🏁 Script executed:
# Inspect the context around lines 3709-3718 to understand self.device
sed -n '3680,3730p' acestep/handler.pyRepository: ace-step/ACE-Step-1.5
Length of output: 3186
🏁 Script executed:
# Get the full implementation of get_effective_free_vram_gb
rg "def get_effective_free_vram_gb" --type py -A 50Repository: ace-step/ACE-Step-1.5
Length of output: 3404
🏁 Script executed:
# Also check if there are any MPS memory utilities in the codebase
rg "torch.backends.mps|torch.mps" --type py | grep -i memoryRepository: ace-step/ACE-Step-1.5
Length of output: 257
Gate VRAM check to CUDA devices only—otherwise MPS/XPU are forced to CPU decode.
get_effective_free_vram_gb() immediately returns 0 when torch.cuda.is_available() is False, so on MPS and XPU devices, _effective_free < 0.5 always evaluates true and unconditionally forces VAE decode to CPU. This degrades performance on systems with only MPS or XPU acceleration.
Gate the memory check to CUDA devices since the function is CUDA-specific:
🛠️ Proposed fix (gate by CUDA)
- if not _vae_cpu:
- _effective_free = get_effective_free_vram_gb()
- logger.info(f"[generate_music] Effective free VRAM before VAE decode: {_effective_free:.2f} GB")
- # If less than 0.5 GB free, VAE decode on GPU will almost certainly OOM
- if _effective_free < 0.5:
- logger.warning(f"[generate_music] Only {_effective_free:.2f} GB free VRAM — auto-enabling CPU VAE decode")
- _vae_cpu = True
+ is_cuda = self.device == "cuda" or (isinstance(self.device, str) and self.device.startswith("cuda"))
+ if not _vae_cpu and is_cuda:
+ _effective_free = get_effective_free_vram_gb()
+ logger.info(f"[generate_music] Effective free VRAM before VAE decode: {_effective_free:.2f} GB")
+ if _effective_free < 0.5:
+ logger.warning(f"[generate_music] Only {_effective_free:.2f} GB free VRAM — auto-enabling CPU VAE decode")
+ _vae_cpu = True📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Check effective free VRAM and auto-enable CPU decode if extremely tight | |
| import os as _os | |
| _vae_cpu = _os.environ.get("ACESTEP_VAE_ON_CPU", "0").lower() in ("1", "true", "yes") | |
| if not _vae_cpu: | |
| _effective_free = get_effective_free_vram_gb() | |
| logger.info(f"[generate_music] Effective free VRAM before VAE decode: {_effective_free:.2f} GB") | |
| # If less than 0.5 GB free, VAE decode on GPU will almost certainly OOM | |
| if _effective_free < 0.5: | |
| logger.warning(f"[generate_music] Only {_effective_free:.2f} GB free VRAM — auto-enabling CPU VAE decode") | |
| _vae_cpu = True | |
| # Check effective free VRAM and auto-enable CPU decode if extremely tight | |
| import os as _os | |
| _vae_cpu = _os.environ.get("ACESTEP_VAE_ON_CPU", "0").lower() in ("1", "true", "yes") | |
| is_cuda = self.device == "cuda" or (isinstance(self.device, str) and self.device.startswith("cuda")) | |
| if not _vae_cpu and is_cuda: | |
| _effective_free = get_effective_free_vram_gb() | |
| logger.info(f"[generate_music] Effective free VRAM before VAE decode: {_effective_free:.2f} GB") | |
| if _effective_free < 0.5: | |
| logger.warning(f"[generate_music] Only {_effective_free:.2f} GB free VRAM — auto-enabling CPU VAE decode") | |
| _vae_cpu = True |
🤖 Prompt for AI Agents
In `@acestep/handler.py` around lines 3709 - 3718, The VRAM auto-check erroneously
runs on non-CUDA backends (MPS/XPU) because get_effective_free_vram_gb() returns
0 when torch.cuda.is_available() is false, forcing VAE decode to CPU; change the
logic in the generate_music VAE decision block to only call
get_effective_free_vram_gb() and apply the _effective_free < 0.5 gate when
torch.cuda.is_available() is true (i.e., wrap the effective-free-VRAM check in a
cuda-available conditional), while preserving the ACESTEP_VAE_ON_CPU env
override and the _vae_cpu variable behavior so only CUDA devices can auto-enable
CPU VAE decode.
| # If KV cache would leave less than 1 GB free, warn and suggest reducing max_model_len | ||
| post_kv_free = (free - config.num_kvcache_blocks * block_bytes) / 1024**3 | ||
| if post_kv_free < 1.0: | ||
| print( | ||
| f"[nanovllm] WARNING: After KV cache allocation, only {post_kv_free:.2f} GB free. " | ||
| f"DiT inference may OOM. Consider reducing max_model_len or using CPU offload." | ||
| ) | ||
|
|
||
| print( | ||
| f"[nanovllm] KV cache allocated: {config.num_kvcache_blocks} blocks × {self.block_size} tokens = " | ||
| f"{max_tokens_capacity} tokens capacity, {kv_cache_size_gb:.2f} GB " | ||
| f"(free: {free / 1024**3:.2f} GB, used: {current / 1024**3:.2f} GB, " | ||
| f"target: {target_total_usage / 1024**3:.2f} GB, block: {block_bytes / 1024**2:.2f} MB)" | ||
| f"target: {target_total_usage / 1024**3:.2f} GB, block: {block_bytes / 1024**2:.2f} MB, " | ||
| f"post_kv_free: {post_kv_free:.2f} GB)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace the Unicode multiplication sign in the log line.
It triggers RUF001 and can cause copy/paste issues in terminals—use plain x.
✏️ Suggested tweak
- f"[nanovllm] KV cache allocated: {config.num_kvcache_blocks} blocks × {self.block_size} tokens = "
+ f"[nanovllm] KV cache allocated: {config.num_kvcache_blocks} blocks x {self.block_size} tokens = "📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # If KV cache would leave less than 1 GB free, warn and suggest reducing max_model_len | |
| post_kv_free = (free - config.num_kvcache_blocks * block_bytes) / 1024**3 | |
| if post_kv_free < 1.0: | |
| print( | |
| f"[nanovllm] WARNING: After KV cache allocation, only {post_kv_free:.2f} GB free. " | |
| f"DiT inference may OOM. Consider reducing max_model_len or using CPU offload." | |
| ) | |
| print( | |
| f"[nanovllm] KV cache allocated: {config.num_kvcache_blocks} blocks × {self.block_size} tokens = " | |
| f"{max_tokens_capacity} tokens capacity, {kv_cache_size_gb:.2f} GB " | |
| f"(free: {free / 1024**3:.2f} GB, used: {current / 1024**3:.2f} GB, " | |
| f"target: {target_total_usage / 1024**3:.2f} GB, block: {block_bytes / 1024**2:.2f} MB)" | |
| f"target: {target_total_usage / 1024**3:.2f} GB, block: {block_bytes / 1024**2:.2f} MB, " | |
| f"post_kv_free: {post_kv_free:.2f} GB)" | |
| # If KV cache would leave less than 1 GB free, warn and suggest reducing max_model_len | |
| post_kv_free = (free - config.num_kvcache_blocks * block_bytes) / 1024**3 | |
| if post_kv_free < 1.0: | |
| print( | |
| f"[nanovllm] WARNING: After KV cache allocation, only {post_kv_free:.2f} GB free. " | |
| f"DiT inference may OOM. Consider reducing max_model_len or using CPU offload." | |
| ) | |
| print( | |
| f"[nanovllm] KV cache allocated: {config.num_kvcache_blocks} blocks x {self.block_size} tokens = " | |
| f"{max_tokens_capacity} tokens capacity, {kv_cache_size_gb:.2f} GB " | |
| f"(free: {free / 1024**3:.2f} GB, used: {current / 1024**3:.2f} GB, " | |
| f"target: {target_total_usage / 1024**3:.2f} GB, block: {block_bytes / 1024**2:.2f} MB, " | |
| f"post_kv_free: {post_kv_free:.2f} GB)" |
🧰 Tools
🪛 Ruff (0.14.14)
[warning] 278-278: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?
(RUF001)
🤖 Prompt for AI Agents
In `@acestep/third_parts/nano-vllm/nanovllm/engine/model_runner.py` around lines
269 - 282, The f-string log in model_runner.py uses a Unicode multiplication
sign (×) which triggers RUF001 and can cause copy/paste/terminal issues; update
the print statement that formats KV cache info (the one referencing
config.num_kvcache_blocks, self.block_size, max_tokens_capacity,
kv_cache_size_gb, free, current, target_total_usage, block_bytes, post_kv_free)
to replace the Unicode "×" with a plain ASCII "x" character so the message
becomes e.g. "{config.num_kvcache_blocks} blocks x {self.block_size} tokens =
..." while keeping the rest of the formatting unchanged.
| 16-20 GB (enthusiast) Songs up to 10 minutes | ||
| ▓▓▓▓▓▓▓▓▓░░░░░░░░░░░ 1-4 songs at a time | ||
| Larger Songwriter brain (1.7B) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Batch size lower bound inconsistent with tier progression.
The enthusiast tier (16-20 GB) shows "1-4 songs at a time", but lower tiers (mainstream at 8-12GB and sweet spot at 12-16GB) already support "2-4 songs at a time". The lower bound of 1 for a higher tier doesn't follow a logical progression.
📝 Suggested correction
- 16-20 GB (enthusiast) Songs up to 10 minutes
- ▓▓▓▓▓▓▓▓▓░░░░░░░░░░░ 1-4 songs at a time
+ 16-20 GB (enthusiast) Songs up to 10 minutes
+ ▓▓▓▓▓▓▓▓▓░░░░░░░░░░░ 2-4 songs at a time📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| 16-20 GB (enthusiast) Songs up to 10 minutes | |
| ▓▓▓▓▓▓▓▓▓░░░░░░░░░░░ 1-4 songs at a time | |
| Larger Songwriter brain (1.7B) | |
| 16-20 GB (enthusiast) Songs up to 10 minutes | |
| ▓▓▓▓▓▓▓▓▓░░░░░░░░░░░ 2-4 songs at a time | |
| Larger Songwriter brain (1.7B) |
🤖 Prompt for AI Agents
In `@docs/en/ace_step_musicians_guide.md` around lines 157 - 160, Update the
enthusiast tier entry so its batch-size range follows the tier progression:
locate the line containing "16-20 GB (enthusiast)" and the phrase "1-4 songs at
a time" and change it to "2-4 songs at a time" (keeping the rest of the text,
e.g., "Songs up to 10 minutes" and "Larger Songwriter brain (1.7B)" unchanged)
so the lower bound is consistent with the 8-12GB and 12-16GB tiers.
| ### 6. `tier-test` — Automated GPU Tier Testing | ||
|
|
||
| Automatically simulates different GPU VRAM sizes using `MAX_CUDA_VRAM` and runs a generation test at each tier. This is the recommended way to validate that all GPU tiers work correctly after modifying `acestep/gpu_config.py`. | ||
|
|
||
| ```bash | ||
| # Test all tiers (4, 6, 8, 12, 16, 20, 24 GB) | ||
| python profile_inference.py --mode tier-test | ||
|
|
||
| # Test specific VRAM sizes | ||
| python profile_inference.py --mode tier-test --tiers 6 8 16 | ||
|
|
||
| # Test with LM enabled (where the tier supports it) | ||
| python profile_inference.py --mode tier-test --tier-with-lm | ||
|
|
||
| # Quick test: skip torch.compile for non-quantized tiers | ||
| python profile_inference.py --mode tier-test --tier-skip-compile | ||
| ``` | ||
|
|
||
| **What it validates per tier:** | ||
| - Correct tier detection and `GPUConfig` construction | ||
| - Model initialization (DiT, VAE, Text Encoder, optionally LM) | ||
| - A short generation run (30s duration, batch=1) completes without OOM | ||
| - Adaptive VAE decode fallback (GPU → CPU offload → full CPU) | ||
| - VRAM usage stays within the simulated limit | ||
|
|
||
| **Output example:** | ||
|
|
||
| ``` | ||
| TIER TEST RESULTS | ||
| ==================================================================================================== | ||
| VRAM Tier LM Duration Status Peak VRAM Notes | ||
| ────────────────────────────────────────────────────────────────────────────── | ||
| 4GB tier1 — 30s ✅ OK 3.8GB VAE decoded on CPU | ||
| 6GB tier2 — 30s ✅ OK 5.4GB Tiled VAE chunk=256 | ||
| 8GB tier4 0.6B 30s ✅ OK 7.2GB vllm backend | ||
| 12GB tier5 1.7B 30s ✅ OK 10.8GB vllm backend | ||
| 16GB tier6a 1.7B 30s ✅ OK 14.5GB offload enabled | ||
| 20GB tier6b 1.7B 30s ✅ OK 17.2GB no offload | ||
| 24GB unlimited 4B 30s ✅ OK 21.3GB full models on GPU | ||
| ``` | ||
|
|
||
| > **Note**: `tier-test` mode uses `torch.cuda.set_per_process_memory_fraction()` to enforce a hard VRAM cap, making simulations realistic even on high-end GPUs (e.g., A100 80GB). | ||
|
|
||
| #### Boundary Testing | ||
|
|
||
| Use `--tier-boundary` to find the minimum VRAM tier at which INT8 quantization and CPU offload can be safely disabled. For each tier, up to three configurations are tested: | ||
|
|
||
| 1. **default** — tier's standard settings | ||
| 2. **no-quant** — quantization disabled, offload unchanged | ||
| 3. **no-offload** — no quantization AND no CPU offload | ||
|
|
||
| ```bash | ||
| # Run boundary tests across all tiers | ||
| python profile_inference.py --mode tier-test --tier-boundary | ||
|
|
||
| # Boundary test with LM enabled | ||
| python profile_inference.py --mode tier-test --tier-boundary --tier-with-lm | ||
|
|
||
| # Save boundary results to JSON | ||
| python profile_inference.py --mode tier-test --tier-boundary --benchmark-output boundary_results.json | ||
| ``` | ||
|
|
||
| The output includes a **Boundary Analysis** summary showing the minimum tier for each capability. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a language to the tier-test output code fence.
This fixes MD040 and keeps Markdown lint clean.
✏️ Suggested fix
-```
+```text
TIER TEST RESULTS
====================================================================================================
...
-```
+```🧰 Tools
🪛 markdownlint-cli2 (0.20.0)
[warning] 187-187: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In `@docs/en/BENCHMARK.md` around lines 160 - 223, The sample output code fence
under the "tier-test" section (the "TIER TEST RESULTS" block) lacks a language
tag; update the opening fence from ``` to ```text to satisfy MD040 and Markdown
linting, leaving the fence contents and closing ``` unchanged so the block is
explicitly marked as plain text.
| print(f" max_batch_without_lm: {gpu_config.max_batch_size_without_lm}") | ||
|
|
||
| # ---- Test 1: Default configuration ---- | ||
| print(f"\n --- Variant: default ---") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove extraneous f-string prefix.
This string has no placeholders but uses an f-string prefix.
🔧 Proposed fix
- print(f"\n --- Variant: default ---")
+ print("\n --- Variant: default ---")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| print(f"\n --- Variant: default ---") | |
| print("\n --- Variant: default ---") |
🧰 Tools
🪛 Ruff (0.14.14)
[error] 1072-1072: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
In `@profile_inference.py` at line 1072, The print call using an unnecessary
f-string should be changed to a regular string: locate the statement print(f"\n
--- Variant: default ---") in profile_inference.py and remove the leading f so
it becomes print("\n --- Variant: default ---"); no other behavior changes are
needed.
| else: | ||
| print(f"\n --- Variant: no-quant — SKIPPED (tier already has quantization=False) ---") | ||
|
|
||
| # ---- Test 3: No quantization AND no offload ---- | ||
| # Skip if the tier already has both disabled | ||
| if gpu_config.quantization_default or gpu_config.offload_to_cpu_default: | ||
| print(f"\n --- Variant: no-offload (quant=None, offload=False) ---") | ||
| result_no_offload = _run_single_tier_test( | ||
| sim_gb, gpu_config, args, example_data, | ||
| checkpoint_dir, disk_lm_models, | ||
| offload_override=False, | ||
| offload_dit_override=False, | ||
| quantization_override=None, | ||
| test_variant="no-offload", | ||
| ) | ||
| all_results.append(result_no_offload) | ||
| else: | ||
| print(f"\n --- Variant: no-offload — SKIPPED (tier already has offload=False, quant=False) ---") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove extraneous f-string prefixes.
Several print statements use f-strings without any placeholders.
🔧 Proposed fix
else:
- print(f"\n --- Variant: no-quant — SKIPPED (tier already has quantization=False) ---")
+ print("\n --- Variant: no-quant — SKIPPED (tier already has quantization=False) ---")
# ---- Test 3: No quantization AND no offload ----
# Skip if the tier already has both disabled
if gpu_config.quantization_default or gpu_config.offload_to_cpu_default:
- print(f"\n --- Variant: no-offload (quant=None, offload=False) ---")
+ print("\n --- Variant: no-offload (quant=None, offload=False) ---")
result_no_offload = _run_single_tier_test(
...
)
all_results.append(result_no_offload)
else:
- print(f"\n --- Variant: no-offload — SKIPPED (tier already has offload=False, quant=False) ---")
+ print("\n --- Variant: no-offload — SKIPPED (tier already has offload=False, quant=False) ---")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| else: | |
| print(f"\n --- Variant: no-quant — SKIPPED (tier already has quantization=False) ---") | |
| # ---- Test 3: No quantization AND no offload ---- | |
| # Skip if the tier already has both disabled | |
| if gpu_config.quantization_default or gpu_config.offload_to_cpu_default: | |
| print(f"\n --- Variant: no-offload (quant=None, offload=False) ---") | |
| result_no_offload = _run_single_tier_test( | |
| sim_gb, gpu_config, args, example_data, | |
| checkpoint_dir, disk_lm_models, | |
| offload_override=False, | |
| offload_dit_override=False, | |
| quantization_override=None, | |
| test_variant="no-offload", | |
| ) | |
| all_results.append(result_no_offload) | |
| else: | |
| print(f"\n --- Variant: no-offload — SKIPPED (tier already has offload=False, quant=False) ---") | |
| else: | |
| print("\n --- Variant: no-quant — SKIPPED (tier already has quantization=False) ---") | |
| # ---- Test 3: No quantization AND no offload ---- | |
| # Skip if the tier already has both disabled | |
| if gpu_config.quantization_default or gpu_config.offload_to_cpu_default: | |
| print("\n --- Variant: no-offload (quant=None, offload=False) ---") | |
| result_no_offload = _run_single_tier_test( | |
| sim_gb, gpu_config, args, example_data, | |
| checkpoint_dir, disk_lm_models, | |
| offload_override=False, | |
| offload_dit_override=False, | |
| quantization_override=None, | |
| test_variant="no-offload", | |
| ) | |
| all_results.append(result_no_offload) | |
| else: | |
| print("\n --- Variant: no-offload — SKIPPED (tier already has offload=False, quant=False) ---") |
🧰 Tools
🪛 Ruff (0.14.14)
[error] 1093-1093: f-string without any placeholders
Remove extraneous f prefix
(F541)
[error] 1098-1098: f-string without any placeholders
Remove extraneous f prefix
(F541)
[error] 1109-1109: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
In `@profile_inference.py` around lines 1092 - 1109, Fix the unnecessary f-string
prefixes on print statements that have no interpolations: replace print(f"...")
with print("...") for the messages around the "no-quant" and "no-offload"
variants in the block that calls _run_single_tier_test; specifically update the
print calls that reference the no-quant and no-offload messages which use
gpu_config.quantization_default and gpu_config.offload_to_cpu_default to
determine skips so they are ordinary string literals instead of f-strings.
| if failing: | ||
| print(f" {capability_name}:") | ||
| print(f" ❌ No tier passed this test. All tested tiers failed.") | ||
| for r in failing: | ||
| err = (r.get("error") or "unknown")[:50] | ||
| print(f" {r['tier_gb']}GB ({r['tier']}): {err}") | ||
| else: | ||
| print(f" {capability_name}:") | ||
| print(f" ⚠️ No test results available for this capability.") | ||
| print() | ||
| return passing[0] if passing else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove extraneous f-string prefixes in boundary summary.
Two print statements use f-strings without placeholders.
🔧 Proposed fix
if failing:
- print(f" {capability_name}:")
- print(f" ❌ No tier passed this test. All tested tiers failed.")
+ print(f" {capability_name}:") # This one is fine, has placeholder
+ print(" ❌ No tier passed this test. All tested tiers failed.")
for r in failing:
err = (r.get("error") or "unknown")[:50]
print(f" {r['tier_gb']}GB ({r['tier']}): {err}")
else:
print(f" {capability_name}:")
- print(f" ⚠️ No test results available for this capability.")
+ print(" ⚠️ No test results available for this capability.")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if failing: | |
| print(f" {capability_name}:") | |
| print(f" ❌ No tier passed this test. All tested tiers failed.") | |
| for r in failing: | |
| err = (r.get("error") or "unknown")[:50] | |
| print(f" {r['tier_gb']}GB ({r['tier']}): {err}") | |
| else: | |
| print(f" {capability_name}:") | |
| print(f" ⚠️ No test results available for this capability.") | |
| print() | |
| return passing[0] if passing else None | |
| if failing: | |
| print(f" {capability_name}:") # This one is fine, has placeholder | |
| print(" ❌ No tier passed this test. All tested tiers failed.") | |
| for r in failing: | |
| err = (r.get("error") or "unknown")[:50] | |
| print(f" {r['tier_gb']}GB ({r['tier']}): {err}") | |
| else: | |
| print(f" {capability_name}:") | |
| print(" ⚠️ No test results available for this capability.") | |
| print() | |
| return passing[0] if passing else None |
🧰 Tools
🪛 Ruff (0.14.14)
[error] 1278-1278: f-string without any placeholders
Remove extraneous f prefix
(F541)
[error] 1284-1284: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
In `@profile_inference.py` around lines 1276 - 1286, The two print calls that use
f-strings without placeholders should be regular strings: in the block that
references capability_name, failing, and passing (the one that prints the
boundary summary and returns passing[0] if any), replace print(f" ❌ No tier
passed this test. All tested tiers failed.") and print(f" ⚠️ No test results
available for this capability.") with print(" ❌ No tier passed this test. All
tested tiers failed.") and print(" ⚠️ No test results available for this
capability.") respectively so the unnecessary f-string prefixes are removed.
| try: | ||
| with torch.inference_mode(): | ||
| # Simulate DiT inference inputs | ||
| # Create dummy latent noise | ||
| noise = torch.randn(batch_size, latent_length, latent_dim, device="cuda", dtype=torch.bfloat16) | ||
|
|
||
| # Simulate text encoder output | ||
| text_hidden = torch.randn(batch_size, 512, 768, device="cuda", dtype=torch.bfloat16) | ||
| text_mask = torch.ones(batch_size, 512, device="cuda", dtype=torch.long) | ||
|
|
||
| # If has CFG, double the batch for classifier-free guidance | ||
| if has_cfg: | ||
| noise_cfg = torch.cat([noise, noise], dim=0) | ||
| text_hidden_cfg = torch.cat([text_hidden, text_hidden], dim=0) | ||
| text_mask_cfg = torch.cat([text_mask, text_mask], dim=0) | ||
| del noise_cfg, text_hidden_cfg, text_mask_cfg | ||
|
|
||
| del noise, text_hidden, text_mask | ||
| torch.cuda.synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the file and understand its context
wc -l scripts/profile_vram.pyRepository: ace-step/ACE-Step-1.5
Length of output: 92
🏁 Script executed:
# Let's read the file to understand the full context around lines 165-183
cat -n scripts/profile_vram.py | sed -n '150,200p'Repository: ace-step/ACE-Step-1.5
Length of output: 2780
🏁 Script executed:
# Let's also see the beginning of the function that contains this code
cat -n scripts/profile_vram.py | sed -n '1,50p'Repository: ace-step/ACE-Step-1.5
Length of output: 2044
🏁 Script executed:
# Let's see the function that contains this code - find where it starts
cat -n scripts/profile_vram.py | sed -n '100,165p'Repository: ace-step/ACE-Step-1.5
Length of output: 2778
🏁 Script executed:
# Search for model.forward or model( calls in the file
rg -n "model\.|\.forward\(|\.generate\(" scripts/profile_vram.pyRepository: ace-step/ACE-Step-1.5
Length of output: 347
🏁 Script executed:
# Let's see the complete try-except block to verify if model is called anywhere
cat -n scripts/profile_vram.py | sed -n '165,210p'Repository: ace-step/ACE-Step-1.5
Length of output: 2706
DiT "inference" profiling doesn't run the model.
The loop only allocates dummy tensors (lines 169–182) and deletes them; it never executes a forward pass. The peak memory measurement (line 186) captures only input tensor allocation, excluding all activation memory from the actual model computation. This will severely under-report peak VRAM usage and mis-calibrate GPU tier selection. Either run a minimal forward pass with model(noise, text_hidden, text_mask) to include activations, or rename peak_inference_gb to peak_input_allocation_gb to clarify the scope.
🤖 Prompt for AI Agents
In `@scripts/profile_vram.py` around lines 165 - 183, The current DiT profiling
only allocates and deletes dummy tensors (noise, text_hidden, text_mask) and
never executes the model, so peak memory misses activation usage; replace the
no-op block with a minimal forward pass by calling the DiT model (e.g.,
model(noise, text_hidden, text_mask) or model.forward(...)) inside the
torch.inference_mode() context so activations are allocated and measured, and
when has_cfg is true duplicate the inputs (noise_cfg, text_hidden_cfg,
text_mask_cfg) and pass the doubled batch to the model to simulate
classifier-free guidance; alternatively, if you intentionally only want to
measure input allocation, rename peak_inference_gb to peak_input_allocation_gb
to reflect the narrower measurement.
| encoder_path = os.path.join(checkpoint_dir, "text_encoder") | ||
| if not os.path.exists(encoder_path): | ||
| print(f" Text encoder not found: {encoder_path}") | ||
| return {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Text encoder path doesn’t match runtime checkpoints.
The handler loads the encoder from Qwen3-Embedding-0.6B, but this script looks for text_encoder, so profiling will likely skip it. Align the path or add a fallback.
🛠️ Proposed fix (use runtime path with fallback)
- encoder_path = os.path.join(checkpoint_dir, "text_encoder")
- if not os.path.exists(encoder_path):
+ encoder_path = os.path.join(checkpoint_dir, "Qwen3-Embedding-0.6B")
+ if not os.path.exists(encoder_path):
+ encoder_path = os.path.join(checkpoint_dir, "text_encoder")
+ if not os.path.exists(encoder_path):
print(f" Text encoder not found: {encoder_path}")
return {}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| encoder_path = os.path.join(checkpoint_dir, "text_encoder") | |
| if not os.path.exists(encoder_path): | |
| print(f" Text encoder not found: {encoder_path}") | |
| return {} | |
| encoder_path = os.path.join(checkpoint_dir, "Qwen3-Embedding-0.6B") | |
| if not os.path.exists(encoder_path): | |
| encoder_path = os.path.join(checkpoint_dir, "text_encoder") | |
| if not os.path.exists(encoder_path): | |
| print(f" Text encoder not found: {encoder_path}") | |
| return {} |
🤖 Prompt for AI Agents
In `@scripts/profile_vram.py` around lines 318 - 321, The code builds encoder_path
= os.path.join(checkpoint_dir, "text_encoder") which doesn't match the runtime
checkpoint name (e.g., "Qwen3-Embedding-0.6B"), so profiling can skip the
encoder; update the logic in scripts/profile_vram.py around encoder_path to try
the runtime checkpoint name as a fallback (check for
os.path.exists(os.path.join(checkpoint_dir, "Qwen3-Embedding-0.6B")) if the
"text_encoder" path is missing) and only return {} after both attempts fail, or
prefer the runtime-named directory when present; ensure references to
encoder_path, checkpoint_dir and the literal names ("text_encoder",
"Qwen3-Embedding-0.6B") are used so the handler and this script align.
…//github.com/ace-step/ACE-Step-1.5 into feat/gpu-compatibility-tier-boundary-testing
Root cause: tier6a (16-20GB) had max_batch_size_with_lm=1, which was overly conservative. Empirical testing on 16GB (simulated) showed: - Without LM: batch=4 uses 13.3GB, batch=7 uses 13.4GB (all fit in 16GB) - With LM (1.7B): batch=2 uses 11.9GB, batch=4 fits within 16GB budget Changes: - tier6a: max_batch_size_with_lm 1→4, max_batch_size_without_lm 4→8 - tier6b: max_batch_size_with_lm 2→4 (20-24GB has ample headroom) - Added --tier-batch-boundary flag to profile_inference.py for automated batch size boundary testing (escalates 1,2,4,8 with LM and without LM) - Added GPU tier config patching during batch tests to bypass inference.py batch clamping - Updated GPU_COMPATIBILITY docs (en/zh/ja/ko) and BENCHMARK docs (en/zh) with corrected batch limits and new batch boundary testing instructions - Updated tests to match new batch size expectations
…bled Root cause: _vram_guard_reduce_batch checks free VRAM *before* DiT runs, but at that point the vllm LM model (weights + KV cache) is still on GPU. On a 16GB GPU with 1.7B LM loaded, only ~7.6GB appears free, causing the guard to slash batch_size from 4 to 1 — even though the LM will be offloaded before DiT actually needs the memory. Fix: When offload_to_cpu=True, trust the static GPU tier config limits (which were empirically validated with offload enabled) instead of the misleading instantaneous free VRAM reading. If batch_size <= tier's max_batch_size_with_lm, skip the dynamic VRAM check entirely. This fixes the bug where users with 16GB GPUs saw LM generate 4 audio codes but DiT only produced 1 output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
🤖 Fix all issues with AI agents
In `@docs/en/GPU_COMPATIBILITY.md`:
- Around line 141-149: The fenced code block showing the "BOUNDARY ANALYSIS"
table lacks a language specifier; update the opening fence that precedes the
BOUNDARY ANALYSIS block from ``` to ```text so the block is explicitly marked as
plain text (look for the lines containing the literal "BOUNDARY ANALYSIS" and
the surrounding triple-backtick fence and change the opening fence accordingly).
- Around line 79-83: Update the tier labels for the GPU simulation examples:
change the comment "Simulate an 8GB GPU (Tier 4)" to "Simulate an 8GB GPU (Tier
3)" and change "Simulate a 12GB GPU (Tier 5)" to "Simulate a 12GB GPU (Tier 4)"
for the lines with MAX_CUDA_VRAM=8 uv run acestep and MAX_CUDA_VRAM=12 uv run
acestep so the examples match the mapping (≤8GB → tier3, ≤12GB → tier4).
In `@docs/ja/GPU_COMPATIBILITY.md`:
- Around line 79-83: Update the comment labels for the GPU simulation examples:
change the "8GB GPU (Tier 4) をシミュレート" comment to "8GB GPU (Tier 3) をシミュレート" and
change the "12GB GPU (Tier 5) をシミュレート" comment to "12GB GPU (Tier 4) をシミュレート" so
the comments match the tier mapping for the MAX_CUDA_VRAM examples (the lines
using MAX_CUDA_VRAM=8 and MAX_CUDA_VRAM=12 before running "uv run acestep").
In `@docs/ko/GPU_COMPATIBILITY.md`:
- Around line 79-83: The tier labels are incorrect for the 8GB/12GB examples;
update the headings for the examples shown (the commented lines above the
commands using MAX_CUDA_VRAM and uv run acestep) so the 8GB example reads "8GB
GPU 시뮬레이션 (티어 3)" and the 12GB example reads "12GB GPU 시뮬레이션 (티어 4)" to match
the mapping ≤8GB → tier3 and ≤12GB → tier4.
In `@docs/zh/BENCHMARK.md`:
- Around line 165-166: Update the "测试所有等级" section so the listed tiers match the
actual default tiers used by profile_inference.py; replace the current list
(which shows 20GB and omits 48GB) with the real defaults (include 48GB and
remove 20GB), or add a short note stating that 20GB is not a default and must be
provided via the --tiers flag; refer to the "python profile_inference.py --mode
tier-test" invocation and the default tier list in the script when making the
change.
In `@docs/zh/GPU_COMPATIBILITY.md`:
- Around line 141-149: The fenced code block that starts with the "BOUNDARY
ANALYSIS" header should include a language specifier to ensure proper rendering;
update the opening triple-backtick for the block containing "BOUNDARY ANALYSIS"
and the table (the block that currently begins with ``` and the header line
"BOUNDARY ANALYSIS") to use a language tag (e.g., change ``` to ```text) so the
table renders as plain text in documentation.
- Around line 79-83: Update the tier labels in the examples: change the heading
"模拟 8GB GPU (Tier 4)" to "模拟 8GB GPU (Tier 3)" and change "模拟 12GB GPU (Tier 5)"
to "模拟 12GB GPU (Tier 4)"; keep the example commands (MAX_CUDA_VRAM=8 uv run
acestep and MAX_CUDA_VRAM=12 uv run acestep) unchanged and ensure the
documentation reflects the mapping ≤8GB → tier3 and ≤12GB → tier4.
In `@profile_inference.py`:
- Around line 789-798: The code may pick a disk-only LM that is too large for
the current tier because disk_lm_models are not filtered by tier size; before
calling find_best_lm_model_on_disk, filter disk_lm_models to only include models
whose size is compatible with the current tier (use the tier variable and
gpu_config-recommended sizing rules) and then pass that filtered list to
find_best_lm_model_on_disk (keep references to lm_model, use_lm,
find_best_lm_model_on_disk, disk_lm_models, gpu_config.recommended_lm_model and
tier so the change is easy to locate).
- Around line 815-823: The current CUDA memory-fraction logic only sets a
reduced fraction when sim_gb < physical VRAM, but doesn't reset the per-process
cap when sim_gb >= physical VRAM, leaving a prior smaller cap in place; update
the block in profile_inference.py that checks torch.cuda.is_available() (the
code using torch.cuda.get_device_properties, total_gb, sim_gb and
torch.cuda.set_per_process_memory_fraction) so that when sim_gb >= total_gb you
explicitly call torch.cuda.set_per_process_memory_fraction(1.0) to clear any
previous cap; retain the existing reduced-fraction calculation path for sim_gb <
total_gb.
| # Simulate an 8GB GPU (Tier 4) | ||
| MAX_CUDA_VRAM=8 uv run acestep | ||
|
|
||
| # Simulate a 12GB GPU (Tier 5) | ||
| MAX_CUDA_VRAM=12 uv run acestep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct tier labels for 8GB/12GB simulation examples.
Tier mapping is ≤8GB → tier3 and ≤12GB → tier4.
📝 Suggested update
-# Simulate an 8GB GPU (Tier 4)
+# Simulate an 8GB GPU (Tier 3)
-# Simulate a 12GB GPU (Tier 5)
+# Simulate a 12GB GPU (Tier 4)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Simulate an 8GB GPU (Tier 4) | |
| MAX_CUDA_VRAM=8 uv run acestep | |
| # Simulate a 12GB GPU (Tier 5) | |
| MAX_CUDA_VRAM=12 uv run acestep | |
| # Simulate an 8GB GPU (Tier 3) | |
| MAX_CUDA_VRAM=8 uv run acestep | |
| # Simulate a 12GB GPU (Tier 4) | |
| MAX_CUDA_VRAM=12 uv run acestep |
🤖 Prompt for AI Agents
In `@docs/en/GPU_COMPATIBILITY.md` around lines 79 - 83, Update the tier labels
for the GPU simulation examples: change the comment "Simulate an 8GB GPU (Tier
4)" to "Simulate an 8GB GPU (Tier 3)" and change "Simulate a 12GB GPU (Tier 5)"
to "Simulate a 12GB GPU (Tier 4)" for the lines with MAX_CUDA_VRAM=8 uv run
acestep and MAX_CUDA_VRAM=12 uv run acestep so the examples match the mapping
(≤8GB → tier3, ≤12GB → tier4).
| ``` | ||
| BOUNDARY ANALYSIS | ||
| ================= | ||
| Capability Min Tier VRAM | ||
| ------------------------------------------------------------ | ||
| No INT8 Quantization tier6b 20GB | ||
| No CPU Offload (all models on GPU) tier6b 20GB | ||
| ------------------------------------------------------------ | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a language specifier to the boundary-analysis output block.
📝 Suggested update
-```
+```text
BOUNDARY ANALYSIS
=================
Capability Min Tier VRAM🧰 Tools
🪛 markdownlint-cli2 (0.20.0)
[warning] 141-141: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In `@docs/en/GPU_COMPATIBILITY.md` around lines 141 - 149, The fenced code block
showing the "BOUNDARY ANALYSIS" table lacks a language specifier; update the
opening fence that precedes the BOUNDARY ANALYSIS block from ``` to ```text so
the block is explicitly marked as plain text (look for the lines containing the
literal "BOUNDARY ANALYSIS" and the surrounding triple-backtick fence and change
the opening fence accordingly).
| # 8GB GPU (Tier 4) をシミュレート | ||
| MAX_CUDA_VRAM=8 uv run acestep | ||
|
|
||
| # 12GB GPU (Tier 5) をシミュレート | ||
| MAX_CUDA_VRAM=12 uv run acestep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct tier labels for 8GB/12GB simulation examples.
Tier mapping is ≤8GB → tier3 and ≤12GB → tier4.
📝 Suggested update
-# 8GB GPU (Tier 4) をシミュレート
+# 8GB GPU (Tier 3) をシミュレート
-# 12GB GPU (Tier 5) をシミュレート
+# 12GB GPU (Tier 4) をシミュレート📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # 8GB GPU (Tier 4) をシミュレート | |
| MAX_CUDA_VRAM=8 uv run acestep | |
| # 12GB GPU (Tier 5) をシミュレート | |
| MAX_CUDA_VRAM=12 uv run acestep | |
| # 8GB GPU (Tier 3) をシミュレート | |
| MAX_CUDA_VRAM=8 uv run acestep | |
| # 12GB GPU (Tier 4) をシミュレート | |
| MAX_CUDA_VRAM=12 uv run acestep |
🤖 Prompt for AI Agents
In `@docs/ja/GPU_COMPATIBILITY.md` around lines 79 - 83, Update the comment labels
for the GPU simulation examples: change the "8GB GPU (Tier 4) をシミュレート" comment
to "8GB GPU (Tier 3) をシミュレート" and change the "12GB GPU (Tier 5) をシミュレート" comment
to "12GB GPU (Tier 4) をシミュレート" so the comments match the tier mapping for the
MAX_CUDA_VRAM examples (the lines using MAX_CUDA_VRAM=8 and MAX_CUDA_VRAM=12
before running "uv run acestep").
| # 8GB GPU 시뮬레이션 (티어 4) | ||
| MAX_CUDA_VRAM=8 uv run acestep | ||
|
|
||
| # 12GB GPU 시뮬레이션 (티어 5) | ||
| MAX_CUDA_VRAM=12 uv run acestep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct tier labels for 8GB/12GB simulation examples.
Tier mapping is ≤8GB → tier3 and ≤12GB → tier4.
📝 Suggested update
-# 8GB GPU 시뮬레이션 (티어 4)
+# 8GB GPU 시뮬레이션 (티어 3)
-# 12GB GPU 시뮬레이션 (티어 5)
+# 12GB GPU 시뮬레이션 (티어 4)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # 8GB GPU 시뮬레이션 (티어 4) | |
| MAX_CUDA_VRAM=8 uv run acestep | |
| # 12GB GPU 시뮬레이션 (티어 5) | |
| MAX_CUDA_VRAM=12 uv run acestep | |
| # 8GB GPU 시뮬레이션 (티어 3) | |
| MAX_CUDA_VRAM=8 uv run acestep | |
| # 12GB GPU 시뮬레이션 (티어 4) | |
| MAX_CUDA_VRAM=12 uv run acestep |
🤖 Prompt for AI Agents
In `@docs/ko/GPU_COMPATIBILITY.md` around lines 79 - 83, The tier labels are
incorrect for the 8GB/12GB examples; update the headings for the examples shown
(the commented lines above the commands using MAX_CUDA_VRAM and uv run acestep)
so the 8GB example reads "8GB GPU 시뮬레이션 (티어 3)" and the 12GB example reads "12GB
GPU 시뮬레이션 (티어 4)" to match the mapping ≤8GB → tier3 and ≤12GB → tier4.
| # 测试所有等级 (4, 6, 8, 12, 16, 20, 24 GB) | ||
| python profile_inference.py --mode tier-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Align the “test all tiers” list with actual tier-test defaults.
The default tier list in code includes 48GB and does not include 20GB; the doc should match or explicitly note that 20GB is only included via --tiers.
📝 Suggested update
-# 测试所有等级 (4, 6, 8, 12, 16, 20, 24 GB)
+# 测试所有等级 (4, 6, 8, 12, 16, 24, 48 GB)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # 测试所有等级 (4, 6, 8, 12, 16, 20, 24 GB) | |
| python profile_inference.py --mode tier-test | |
| # 测试所有等级 (4, 6, 8, 12, 16, 24, 48 GB) | |
| python profile_inference.py --mode tier-test |
🤖 Prompt for AI Agents
In `@docs/zh/BENCHMARK.md` around lines 165 - 166, Update the "测试所有等级" section so
the listed tiers match the actual default tiers used by profile_inference.py;
replace the current list (which shows 20GB and omits 48GB) with the real
defaults (include 48GB and remove 20GB), or add a short note stating that 20GB
is not a default and must be provided via the --tiers flag; refer to the "python
profile_inference.py --mode tier-test" invocation and the default tier list in
the script when making the change.
| # 模拟 8GB GPU (Tier 4) | ||
| MAX_CUDA_VRAM=8 uv run acestep | ||
|
|
||
| # 模拟 12GB GPU (Tier 5) | ||
| MAX_CUDA_VRAM=12 uv run acestep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct tier labels for 8GB/12GB simulation examples.
Tier mapping is ≤8GB → tier3 and ≤12GB → tier4.
📝 Suggested update
-# 模拟 8GB GPU (Tier 4)
+# 模拟 8GB GPU (Tier 3)
-# 模拟 12GB GPU (Tier 5)
+# 模拟 12GB GPU (Tier 4)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # 模拟 8GB GPU (Tier 4) | |
| MAX_CUDA_VRAM=8 uv run acestep | |
| # 模拟 12GB GPU (Tier 5) | |
| MAX_CUDA_VRAM=12 uv run acestep | |
| # 模拟 8GB GPU (Tier 3) | |
| MAX_CUDA_VRAM=8 uv run acestep | |
| # 模拟 12GB GPU (Tier 4) | |
| MAX_CUDA_VRAM=12 uv run acestep |
🤖 Prompt for AI Agents
In `@docs/zh/GPU_COMPATIBILITY.md` around lines 79 - 83, Update the tier labels in
the examples: change the heading "模拟 8GB GPU (Tier 4)" to "模拟 8GB GPU (Tier 3)"
and change "模拟 12GB GPU (Tier 5)" to "模拟 12GB GPU (Tier 4)"; keep the example
commands (MAX_CUDA_VRAM=8 uv run acestep and MAX_CUDA_VRAM=12 uv run acestep)
unchanged and ensure the documentation reflects the mapping ≤8GB → tier3 and
≤12GB → tier4.
| ``` | ||
| BOUNDARY ANALYSIS | ||
| ================= | ||
| Capability Min Tier VRAM | ||
| ------------------------------------------------------------ | ||
| No INT8 Quantization tier6b 20GB | ||
| No CPU Offload (all models on GPU) tier6b 20GB | ||
| ------------------------------------------------------------ | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a language specifier to the boundary-analysis output block.
📝 Suggested update
-```
+```text
BOUNDARY ANALYSIS
=================
Capability Min Tier VRAM🤖 Prompt for AI Agents
In `@docs/zh/GPU_COMPATIBILITY.md` around lines 141 - 149, The fenced code block
that starts with the "BOUNDARY ANALYSIS" header should include a language
specifier to ensure proper rendering; update the opening triple-backtick for the
block containing "BOUNDARY ANALYSIS" and the table (the block that currently
begins with ``` and the header line "BOUNDARY ANALYSIS") to use a language tag
(e.g., change ``` to ```text) so the table renders as plain text in
documentation.
| # Find LM model on disk | ||
| lm_model = None | ||
| lm_backend = gpu_config.recommended_backend | ||
| if use_lm: | ||
| lm_model = find_best_lm_model_on_disk( | ||
| gpu_config.recommended_lm_model, disk_lm_models | ||
| ) | ||
| if not lm_model: | ||
| print(f" ⚠️ No compatible LM model on disk for tier {tier}, skipping LM") | ||
| use_lm = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filter disk LM models by tier-compatible sizes before selection.
Without size filtering, a larger disk-only LM can be chosen and cause false failures on low tiers.
🛠️ Suggested fix
if use_lm:
- lm_model = find_best_lm_model_on_disk(
- gpu_config.recommended_lm_model, disk_lm_models
- )
+ compatible_models = [
+ m for m in disk_lm_models
+ if is_lm_model_size_allowed(m, gpu_config.available_lm_models)
+ ]
+ lm_model = find_best_lm_model_on_disk(
+ gpu_config.recommended_lm_model, compatible_models
+ )
if not lm_model:
print(f" ⚠️ No compatible LM model on disk for tier {tier}, skipping LM")
use_lm = False🤖 Prompt for AI Agents
In `@profile_inference.py` around lines 789 - 798, The code may pick a disk-only
LM that is too large for the current tier because disk_lm_models are not
filtered by tier size; before calling find_best_lm_model_on_disk, filter
disk_lm_models to only include models whose size is compatible with the current
tier (use the tier variable and gpu_config-recommended sizing rules) and then
pass that filtered list to find_best_lm_model_on_disk (keep references to
lm_model, use_lm, find_best_lm_model_on_disk, disk_lm_models,
gpu_config.recommended_lm_model and tier so the change is easy to locate).
| if torch.cuda.is_available(): | ||
| total_bytes = torch.cuda.get_device_properties(0).total_memory | ||
| total_gb = total_bytes / (1024 ** 3) | ||
| if sim_gb < total_gb: | ||
| reference_context_gb = 0.5 | ||
| allocator_budget_gb = max(0.5, sim_gb - reference_context_gb) | ||
| fraction = max(0.01, min(1.0, allocator_budget_gb / total_gb)) | ||
| torch.cuda.set_per_process_memory_fraction(fraction) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
find . -name "profile_inference.py" -type fRepository: ace-step/ACE-Step-1.5
Length of output: 87
🏁 Script executed:
# Get file size first
wc -l profile_inference.py 2>/dev/null || echo "File not found, searching..."
find . -name "*.py" -path "*profile*" -type fRepository: ace-step/ACE-Step-1.5
Length of output: 139
🏁 Script executed:
# Read the specific lines with broader context (lines 800-835)
sed -n '800,835p' profile_inference.py | cat -nRepository: ace-step/ACE-Step-1.5
Length of output: 1767
🏁 Script executed:
# Search for tier-test related code and loop structures
rg -n "tier" profile_inference.py -i -A 2 -B 2Repository: ace-step/ACE-Step-1.5
Length of output: 22119
🏁 Script executed:
# Look for where this function is called
rg -n "torch.cuda.set_per_process_memory_fraction" . --type py -B 5 -A 5Repository: ace-step/ACE-Step-1.5
Length of output: 2768
🏁 Script executed:
# Check PyTorch documentation or source for set_per_process_memory_fraction behavior
rg -n "set_per_process_memory_fraction\|memory_fraction" . --type py -B 2 -A 2Repository: ace-step/ACE-Step-1.5
Length of output: 47
🏁 Script executed:
# Look at the loop structure more carefully - check if tiers are tested in order and if multiple variants are called per tier
sed -n '1075,1190p' profile_inference.py | cat -nRepository: ace-step/ACE-Step-1.5
Length of output: 6728
🏁 Script executed:
# Check if there's any initialization or reset code before/after the tier loop
sed -n '1019,1100p' profile_inference.py | cat -nRepository: ace-step/ACE-Step-1.5
Length of output: 4212
🏁 Script executed:
# Search for any documentation or comments about the memory fraction behavior
rg -n "memory.fraction\|cuda.*fraction\|1\.0\|reset" profile_inference.py -i | head -30Repository: ace-step/ACE-Step-1.5
Length of output: 47
🏁 Script executed:
# Check if there's any documentation or code comments about memory fraction in the codebase
rg -n "1\.0.*fraction\|fraction.*1\.0\|unrestricted\|reset.*cuda\|cuda.*reset" . --type py -iRepository: ace-step/ACE-Step-1.5
Length of output: 47
🏁 Script executed:
# Look for any tests or examples of set_per_process_memory_fraction to understand typical usage
rg -n "set_per_process_memory_fraction" . --type py -B 3 -A 3Repository: ace-step/ACE-Step-1.5
Length of output: 3654
Reset CUDA memory fraction to 1.0 when sim_gb ≥ physical GPU VRAM.
Within each tier, multiple test variants (default, no-quant, no-offload, batch-N) are run sequentially. A smaller fraction set by an earlier variant persists and caps later variants, causing false failures or skewed VRAM measurements. Always reset to 1.0 when the simulated VRAM is at or above physical capacity to clear the prior cap.
🛠️ Suggested fix
if sim_gb < total_gb:
reference_context_gb = 0.5
allocator_budget_gb = max(0.5, sim_gb - reference_context_gb)
fraction = max(0.01, min(1.0, allocator_budget_gb / total_gb))
torch.cuda.set_per_process_memory_fraction(fraction)
+ else:
+ # Ensure we don't keep a tighter cap from a previous tier or variant
+ torch.cuda.set_per_process_memory_fraction(1.0)🤖 Prompt for AI Agents
In `@profile_inference.py` around lines 815 - 823, The current CUDA
memory-fraction logic only sets a reduced fraction when sim_gb < physical VRAM,
but doesn't reset the per-process cap when sim_gb >= physical VRAM, leaving a
prior smaller cap in place; update the block in profile_inference.py that checks
torch.cuda.is_available() (the code using torch.cuda.get_device_properties,
total_gb, sim_gb and torch.cuda.set_per_process_memory_fraction) so that when
sim_gb >= total_gb you explicitly call
torch.cuda.set_per_process_memory_fraction(1.0) to clear any previous cap;
retain the existing reduced-fraction calculation path for sim_gb < total_gb.
8GB GPUs with 0.6B LM + offload have enough headroom for batch=2. DiT(4.46) + context(0.5) ≈ 5.0GB leaves ~3GB free, sufficient for 2 samples of DiT activations (~0.8GB each at 60s). Updated gpu_config.py and docs (en/zh/ja).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@acestep/handler.py`:
- Around line 1605-1607: The device check currently compares the device variable
to strings and can miss non-CUDA devices (e.g., torch.device objects like
"xpu"), causing CUDA-only VRAM logic to run incorrectly; update the check in the
handler where device is read (the block using device = self.device and the
similar block at the later occurrence) to reliably detect CUDA by converting to
string or using device.type (e.g., if str(device) in ("cpu","mps") or if
getattr(device,"type",str(device)) in ("cpu","mps")), and keep the subsequent
call to get_effective_free_vram_gb() only for CUDA devices so batch_size isn't
forced to 1 on non-CUDA devices.
🧹 Nitpick comments (1)
acestep/handler.py (1)
1583-1586:use_lmis currently unused in the guard.Either drop it or use it to pick the appropriate tier limit so callers can control whether LM-specific caps apply.
♻️ Possible refinement
- tier_max = gpu_config.max_batch_size_with_lm + tier_max = ( + gpu_config.max_batch_size_with_lm + if use_lm + else gpu_config.max_batch_size_without_lm + )Also applies to: 1615-1622
| device = self.device | ||
| if device == "cpu" or device == "mps": | ||
| return batch_size # No CUDA VRAM to guard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skip CUDA-only free-VRAM checks on non-CUDA devices to avoid forced batch=1.
The current guard uses string equality comparison on what is likely a torch.device object, which fails for non-CUDA devices. Devices like "xpu" bypass this check and proceed to call get_effective_free_vram_gb(), which may report 0 VRAM and collapse batch size. Convert the device to string before comparison or use .type attribute to reliably detect CUDA devices.
🛠️ Suggested fix
- device = self.device
- if device == "cpu" or device == "mps":
- return batch_size # No CUDA VRAM to guard
+ device = self.device
+ device_str = str(device)
+ is_cuda = device_str == "cuda" or device_str.startswith("cuda")
+ if not is_cuda:
+ return batch_size # No CUDA VRAM to guardAlso applies to: 1631-1633
🤖 Prompt for AI Agents
In `@acestep/handler.py` around lines 1605 - 1607, The device check currently
compares the device variable to strings and can miss non-CUDA devices (e.g.,
torch.device objects like "xpu"), causing CUDA-only VRAM logic to run
incorrectly; update the check in the handler where device is read (the block
using device = self.device and the similar block at the later occurrence) to
reliably detect CUDA by converting to string or using device.type (e.g., if
str(device) in ("cpu","mps") or if getattr(device,"type",str(device)) in
("cpu","mps")), and keep the subsequent call to get_effective_free_vram_gb()
only for CUDA devices so batch_size isn't forced to 1 on non-CUDA devices.
When batch_size > 1, VAE decode VRAM scales linearly with batch size. On 8GB GPUs (tier3, batch=2), decoding 2 samples at once exceeds VRAM. Fix: In _tiled_decode_inner, when B > 1, decode each sample individually and move results to CPU immediately after each decode. This keeps peak VRAM constant regardless of batch size. Also updated tier3 max_batch_size_with_lm from 1 to 2 (8GB GPUs with 0.6B LM + offload have sufficient headroom for batch=2).
Now that VAE decode is batch-sequential (no extra VRAM per sample), the bottleneck is only DiT activations which scale modestly. Updated batch limits: - tier5 (12-16GB): with_lm 2→4, without_lm stays 4 - tier6b (20-24GB): with_lm 4→8, without_lm stays 8 Summary of all tiers (LM / No LM): tier1 ≤4GB: 1/1 tier4 8-12GB: 2/4 tier2 4-6GB: 1/1 tier5 12-16GB: 4/4 tier3 6-8GB: 2/2 tier6a 16-20GB: 4/8 tier6b 20-24GB: 8/8 unlimited ≥24GB: 8/8 Updated docs (en/zh/ja).
- test_time_scaling.py: Add _load_scoring_model_context() that moves the HF scoring model to GPU only during forward pass and offloads back to CPU afterwards (for vllm/mlx backends). Move output logits to CPU to avoid keeping large vocab tensors on GPU. - llm_inference.py: When offload_to_cpu=True, keep the HF scoring model on CPU after initial loading (vllm/mlx backends). The context manager in test_time_scaling.py handles GPU placement on demand. - dit_alignment_score.py: Force MusicLyricScorer.calculate_score() to always compute on CPU. The scoring matrices are small and do not benefit from GPU acceleration, while occupying VRAM that DiT/VAE/LM need on low-VRAM GPUs.
…//github.com/ace-step/ACE-Step-1.5 into feat/gpu-compatibility-tier-boundary-testing
Keep _vram_guard_reduce_batch (our feature). Remove _start_diffusion_progress_estimator (now provided by ProgressMixin from acestep/core/generation/handler/progress.py).
Summary
Add a comprehensive GPU tier configuration system and boundary testing framework to determine minimum VRAM requirements for different optimization levels.
GPU Tier System (acestep/gpu_config.py)
Boundary Testing (profile_inference.py)
Boundary Test Results
Handler & UI Updates
Documentation
Summary by CodeRabbit
New Features
Improvements
Documentation