Skip to content

Conversation

@ChuxiJ
Copy link
Contributor

@ChuxiJ ChuxiJ commented Feb 12, 2026

Summary by CodeRabbit

  • New Features

    • Added "Cover Strength" parameter to control noise initialization in audio generation (range: 0 = pure noise, 1 = closest to source audio).
    • Enhanced generation mode selection to intelligently reflect model capabilities.
    • Implemented automatic model code synchronization on startup.
  • Improvements

    • Generation mode now persists correctly across model initialization.
    • UI state propagation expanded for better consistency across generation workflows.
  • Localization

    • Added multilingual support for new "Cover Strength" control in English, Hebrew, Japanese, and Chinese.

@coderabbitai
Copy link

coderabbitai bot commented Feb 12, 2026

📝 Walkthrough

Walkthrough

This PR introduces cover noise strength parameter support throughout the generation pipeline, implements automatic model code synchronization between source and checkpoint directories, adds pure base model detection logic to distinguish model variants, preserves generation mode across initialization, and contributes comprehensive ACEStep v15 model implementations with guidance mechanisms.

Changes

Cohort / File(s) Summary
Constants & Metadata
acestep/constants.py, acestep/models/__init__.py
Updated descriptive comments for generation mode declarations; added header marking ACEStep models directory as authoritative source with auto-sync on startup.
Model Code Synchronization
acestep/model_downloader.py, acestep/handler.py
Introduced model code mismatch detection and auto-sync subsystem: checkpoint-to-variant mappings, file hashing, mismatch detection via _check_code_mismatch, sync via _sync_model_code_files. Integrated into download flows and service initialization to synchronize Python files between source models and checkpoints.
Cover Noise Strength Parameter Threading
acestep/handler.py, acestep/inference.py, acestep/gradio_ui/events/results_handlers.py
Added cover_noise_strength parameter (default 0.0) with propagation through batch preparation, generation kwargs, and GenerationParams; exposed in public function signatures for generate_music, service_generate, and _prepare_batch.
Pure Base Model Detection & Mode Preservation
acestep/gradio_ui/events/generation_handlers.py, acestep/gradio_ui/events/__init__.py
Introduced _is_pure_base_model helper to classify models; updated get_generation_mode_choices, get_model_type_ui_settings, and update_model_type_settings to accept is_pure_base and current_mode parameters; modified init button and config path handlers to propagate generation mode and base model context across UI reconfigurations.
Generation UI & i18n
acestep/gradio_ui/interfaces/generation.py, acestep/gradio_ui/i18n/{en,he,ja,zh}.json
Added cover_noise_strength slider to generation UI (visible in Remix mode); integrated _is_pure_base_model for initial mode choices computation; added translated labels and info text for cover noise strength control across four languages.
Model Configurations (Base/SFT/Turbo)
acestep/models/base/configuration_acestep_v15.py, acestep/models/sft/configuration_acestep_v15.py, acestep/models/turbo/configuration_acestep_v15.py
Added AceStepConfig classes extending PretrainedConfig with comprehensive model hyperparameters (RoPE, attention, quantization, encoder/decoder settings, timbre/audio parameters); includes rope_scaling normalization, layer type validation, and backward compatibility handling.
Guidance Mechanisms
acestep/models/base/apg_guidance.py, acestep/models/sft/apg_guidance.py
Introduced MomentumBuffer and guidance functions (APG, CFG, ADG variants) with utilities for projection, cosine similarity, and perpendicular component decomposition; supports angle-based dynamic guidance with optional normalization and clipping.
Model Implementations (Base/SFT/Turbo)
acestep/models/base/modeling_acestep_v15_base.py, acestep/models/sft/modeling_acestep_v15_base.py, acestep/models/turbo/modeling_acestep_v15_turbo.py
Comprehensive diffusion-transformer architectures with mask utilities, embeddings, multi-head attention, encoder/decoder layers, audio tokenization/detokenization, conditioning encoders, and end-to-end generation models; includes support for flow-matching training, CFG, and cross-attention conditioning across text/lyric/timbre inputs.
Documentation
docs/lora-scale-deterministic-registry-note-2026-02-10.md
Removed LoRA scale deterministic registry design documentation note.

Sequence Diagram(s)

sequenceDiagram
    participant UI as UI Handler
    participant Config as Config Path Handler
    participant ModelInit as Init Service
    participant GenHandlers as Generation Handlers
    
    UI->>Config: config_path changed (with generation_mode)
    Config->>GenHandlers: update_model_type_settings(config_path, current_mode)
    GenHandlers->>GenHandlers: _is_pure_base_model(config_path)
    GenHandlers->>GenHandlers: get_generation_mode_choices(is_pure_base)
    GenHandlers-->>Config: mode choices updated (preserving current_mode if valid)
    
    UI->>UI: user clicks init button
    UI->>ModelInit: init_service_wrapper(..., current_mode)
    ModelInit->>GenHandlers: get_model_type_ui_settings(..., current_mode, is_pure_base)
    GenHandlers-->>ModelInit: UI settings with preserved mode
    ModelInit-->>UI: generation_mode (preserved across init)
Loading
sequenceDiagram
    participant Download as Model Downloader
    participant Source as Models Source Dir
    participant Checkpoint as Checkpoint Dir
    participant Handler as Service Handler
    
    Download->>Download: download_main_model()
    Download->>Download: _sync_model_code_files(variant_name, checkpoint_dir)
    Download->>Checkpoint: copy Python files from source variant
    Checkpoint-->>Download: files synced & logged
    
    Handler->>Handler: init_service_wrapper()
    Handler->>Download: _check_code_mismatch(config_path, checkpoint)
    alt Mismatch Detected
        Handler->>Download: _sync_model_code_files()
        Download->>Source: locate authoritative source
        Download->>Checkpoint: sync/overwrite Python files
    end
    Checkpoint-->>Handler: code synchronized
Loading
sequenceDiagram
    participant UI as UI/Handler
    participant ResultsHandler as Results Handler
    participant GenerationParams as Generation Params
    participant Inference as Inference Engine
    
    UI->>ResultsHandler: generate_with_progress(..., cover_noise_strength)
    ResultsHandler->>ResultsHandler: capture_current_params(cover_noise_strength)
    ResultsHandler->>GenerationParams: create GenerationParams(cover_noise_strength=value)
    GenerationParams-->>ResultsHandler: params ready
    ResultsHandler->>Inference: call with params.cover_noise_strength
    Inference->>Inference: DiT forward(..., cover_noise_strength)
    Inference-->>ResultsHandler: generated audio with noise characteristics applied
    ResultsHandler->>UI: batch/state updated with cover_noise_strength preserved
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Rationale: PR introduces substantial new model code (three 2000+ line modeling files with complex transformer architectures), new model code synchronization subsystem with file hashing and mismatch detection, consistent but wide parameter threading across handlers/UI/inference, new helper functions for base model detection, multiple configuration classes, and guidance mechanism utilities. While many edits follow repetitive patterns (same parameter added across functions), the modeling modules contain dense logic with multiple interdependent components requiring careful review.

Possibly Related PRs

Poem

🐰 A rabbit hops through models grand,
where base and turbo take their stand,
with cover strength now in the mix,
and code that syncs—no more conflict!
Mode preserved through init's dance,
transformers bloom with newfound chance! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'support cover (exp)' is vague and does not clearly convey what changes are being made. While the word 'cover' relates to some files, the title uses a non-descriptive abbreviation '(exp)' and lacks specificity about the actual feature or modification. Use a more descriptive title that clarifies the main change, such as 'Add cover noise strength parameter and base model support' or 'Implement cover audio conditioning with noise strength control'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into main

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch support_cover

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (8)
acestep/constants.py (1)

1-4: 🛠️ Refactor suggestion | 🟠 Major

Module exceeds 200 LOC; split or add a follow‑up plan.

This file is over the hard cap; please split by responsibility or document a concrete follow‑up split plan in PR notes.

As per coding guidelines, "Target module size: optimal <= 150 LOC, hard cap <= 200 LOC; if exceeded, split by responsibility before merging or justify in PR notes with a concrete follow-up split plan".

acestep/inference.py (1)

1-12: 🛠️ Refactor suggestion | 🟠 Major

File exceeds 200 LOC hard cap — split or document a concrete follow‑up plan.
This module is well over the hard cap; please split by responsibility or add PR notes with a concrete split plan.

As per coding guidelines, "Target module size: optimal <= 150 LOC, hard cap <= 200 LOC; if exceeded, split by responsibility before merging or justify in PR notes with a concrete follow-up split plan".

acestep/gradio_ui/events/__init__.py (2)

1-7: 🛠️ Refactor suggestion | 🟠 Major

File exceeds 200 LOC hard cap — split or document a concrete follow‑up plan.
This event wiring module is far over the hard cap; please split by responsibility or add a concrete split plan in PR notes.

As per coding guidelines, "Target module size: optimal <= 150 LOC, hard cap <= 200 LOC; if exceeded, split by responsibility before merging or justify in PR notes with a concrete follow-up split plan".


972-1003: ⚠️ Potential issue | 🟠 Major

restore_params_btn doesn't restore cover_noise_strength.

The slider is saved in batch parameters but won't be restored, breaking the "reuse historical settings" feature. Add it to both the handler extraction and the outputs list.

🧩 Suggested fix

In results_handlers.py around line 2328, add extraction:

     latent_shift = params.get("latent_shift", 0.0)
     latent_rescale = params.get("latent_rescale", 1.0)
+    cover_noise_strength = params.get("cover_noise_strength", 0.0)

Update return statement (lines 2345-2353):

     return (
         codes_main, captions, lyrics, bpm, key_scale, time_signature,
         vocal_language, audio_duration, batch_size_input, inference_steps,
         lm_temperature, lm_cfg_scale, lm_top_k, lm_top_p, think_checkbox,
         use_cot_caption, use_cot_language, allow_lm_batch,
         track_name, complete_track_classes,
-        enable_normalization, normalization_db,
-        latent_shift, latent_rescale
+        enable_normalization, normalization_db,
+        latent_shift, latent_rescale, cover_noise_strength
     )

In __init__.py add to outputs list:

         generation_section["latent_shift"],
         generation_section["latent_rescale"],
+        generation_section["cover_noise_strength"],
     ]
acestep/model_downloader.py (1)

1-20: 🛠️ Refactor suggestion | 🟠 Major

File exceeds 200 LOC hard cap — split or document a concrete follow‑up plan.
This module is well past the hard cap; please split by responsibility or add a concrete split plan in PR notes.

As per coding guidelines, "Target module size: optimal <= 150 LOC, hard cap <= 200 LOC; if exceeded, split by responsibility before merging or justify in PR notes with a concrete follow-up split plan".

acestep/handler.py (3)

1966-1981: ⚠️ Potential issue | 🟡 Minor

cover_noise_strength is unused in _prepare_batch.

It’s accepted but never used or recorded, and the docstring doesn’t mention it. Either remove it (and update callers) or include it in the batch + docstring so the data flow is explicit. As per coding guidelines, "Docstrings are mandatory for all new or modified Python modules, classes, and functions; must be concise and include purpose plus key inputs/outputs and raised exceptions when relevant."


3372-3417: ⚠️ Potential issue | 🟡 Minor

Document cover_noise_strength in generate_music docstring.

This is a modified public function and the new parameter isn’t described. Please add it as a key input. As per coding guidelines, "Docstrings are mandatory for all new or modified Python modules, classes, and functions; must be concise and include purpose plus key inputs/outputs and raised exceptions when relevant."


2728-2835: ⚠️ Potential issue | 🟠 Major

Add cover_noise_strength parameter to MLX diffusion path for API compatibility.

The PyTorch path receives cover_noise_strength via generate_kwargs, but the MLX fast-path ignores it entirely. To maintain consistent API signatures between backends, add the parameter to _mlx_run_diffusion() and thread it through to the call site.

🔧 Suggested update
     def _mlx_run_diffusion(
         self,
         encoder_hidden_states,
         encoder_attention_mask,
         context_latents,
         src_latents,
         seed,
         infer_method: str = "ode",
         shift: float = 3.0,
         timesteps=None,
         audio_cover_strength: float = 1.0,
+        cover_noise_strength: float = 0.0,
         encoder_hidden_states_non_cover=None,
         encoder_attention_mask_non_cover=None,
         context_latents_non_cover=None,

And at the call site in handler.py:

                         outputs = self._mlx_run_diffusion(
                             encoder_hidden_states=encoder_hidden_states,
                             encoder_attention_mask=encoder_attention_mask,
                             context_latents=context_latents,
                             src_latents=src_latents,
                             seed=seed_param,
                             infer_method=infer_method,
                             shift=shift,
                             timesteps=ts_arg,
                             audio_cover_strength=audio_cover_strength,
+                            cover_noise_strength=cover_noise_strength,
                             encoder_hidden_states_non_cover=enc_hs_nc,
                             encoder_attention_mask_non_cover=enc_am_nc,
                             context_latents_non_cover=ctx_nc,
                         )
🤖 Fix all issues with AI agents
In `@acestep/models/sft/modeling_acestep_v15_base.py`:
- Around line 1642-1645: The code mistakenly treats self.tokenize as an object
with a quantizer attribute, causing AttributeError; change references so the
quantizer is accessed from self.tokenizer (e.g., replace
self.tokenize.quantizer.get_output_from_indices(...) with
self.tokenizer.quantizer.get_output_from_indices(...)) and ensure the tokenize
method call remains as self.tokenize(hidden_states, silence_latent,
attention_mask) to obtain lm_hints_5Hz, indices, llm_mask; verify symbols
touched: self.tokenize, self.tokenizer, get_output_from_indices, lm_hints_5Hz,
indices, llm_mask.

In `@acestep/models/turbo/modeling_acestep_v15_turbo.py`:
- Around line 1639-1641: The bug is that code treats self.tokenize as an object
with a .quantizer attribute but self.tokenize is a method; change the call to
obtain the tokenizer object before accessing .quantizer. Replace the problematic
access in modeling_acestep_v15_turbo (the block that checks audio_codes) to call
the tokenizer (e.g., use self.tokenize() or the correct tokenizer property) and
then call .quantizer.get_output_from_indices(audio_codes) — for example use
self.tokenize().quantizer.get_output_from_indices(audio_codes) or, if a
quantizer lives on the instance, use
self.quantizer.get_output_from_indices(audio_codes) so the AttributeError is
avoided.
🟠 Major comments (26)
acestep/models/__init__.py-1-3 (1)

1-3: 🛠️ Refactor suggestion | 🟠 Major

Add a module docstring (required by guidelines).

These header comments should be a proper module docstring describing purpose plus key behaviors (authoritative source and auto-sync).
As per coding guidelines: “Docstrings are mandatory for all new or modified Python modules, classes, and functions; must be concise and include purpose plus key inputs/outputs and raised exceptions when relevant.”

acestep/models/sft/apg_guidance.py-1-220 (1)

1-220: ⚠️ Potential issue | 🟠 Major

Code duplication: this file is identical to acestep/models/base/apg_guidance.py.

Both files are 220 LOC and contain identical implementations. This creates maintenance burden—any bug fix or enhancement must be applied in two places.

Consider one of these approaches:

  1. Import and re-export: If SFT needs the same API, simply re-export from base:
    from acestep.models.base.apg_guidance import (
        MomentumBuffer, project, apg_forward, cfg_forward,
        call_cos_tensor, compute_perpendicular_component,
        adg_forward, adg_w_norm_forward, adg_wo_clip_forward,
    )
  2. Inheritance/composition: If SFT variants will diverge, extend the base rather than copy.

Additionally, at 220 LOC this file exceeds the 200 LOC hard cap. Per coding guidelines: "Target module size: optimal <= 150 LOC, hard cap <= 200 LOC; if exceeded, split by responsibility before merging or justify in PR notes with a concrete follow-up split plan."

acestep/models/base/apg_guidance.py-1-220 (1)

1-220: ⚠️ Potential issue | 🟠 Major

Module exceeds the 200 LOC hard cap — split by responsibility or add a follow‑up plan.

This file is ~220 LOC; please split (e.g., projection/momentum helpers vs ADG wrappers) or add a concrete follow-up split plan in PR notes.

As per coding guidelines, "Target module size: optimal <= 150 LOC, hard cap <= 200 LOC; if exceeded, split by responsibility before merging or justify in PR notes with a concrete follow-up split plan".

acestep/models/base/apg_guidance.py-1-3 (1)

1-3: ⚠️ Potential issue | 🟠 Major

Add a module-level docstring.

All new or modified Python modules must include a concise docstring with purpose, key inputs/outputs, and relevant exceptions per the coding guidelines.

📝 Proposed module docstring
+"""APG/ADG guidance utilities for ACE-Step diffusion models.
+
+Inputs: conditional/unconditional prediction tensors and guidance parameters.
+Outputs: guided prediction tensors.
+Raises: ValueError/TypeError for invalid shapes or sigma types.
+"""
 import torch
 import torch.nn.functional as F
acestep/models/base/apg_guidance.py-16-30 (1)

16-30: ⚠️ Potential issue | 🟠 Major

Fix device preservation and mutable default in project; also update duplicate in sft module.

Using v0.device.type and returning via .to(device_type) loses the device index (e.g., cuda:1 becomes cuda), causing tensors to move to wrong GPUs. Also, dims=[-1] is a mutable default argument, which is a Python anti-pattern. The function also lacks a docstring and return type hint as required by coding guidelines.

The same bug exists identically in acestep/models/sft/apg_guidance.py (lines 16-30).

🐛 Proposed fix
 def project(
     v0: torch.Tensor,  # [B, C, T]
     v1: torch.Tensor,  # [B, C, T]
-    dims=[-1],
-):
-    dtype = v0.dtype
-    device_type = v0.device.type
-    if device_type == "mps":
+    dims=None,
+) -> tuple[torch.Tensor, torch.Tensor]:
+    """Project v0 into components parallel and orthogonal to v1."""
+    if dims is None:
+        dims = (-1,)
+    dtype = v0.dtype
+    device = v0.device
+    if device.type == "mps":
         v0, v1 = v0.cpu(), v1.cpu()
 
     v0, v1 = v0.double(), v1.double()
     v1 = torch.nn.functional.normalize(v1, dim=dims)
     v0_parallel = (v0 * v1).sum(dim=dims, keepdim=True) * v1
     v0_orthogonal = v0 - v0_parallel
-    return v0_parallel.to(dtype).to(device_type), v0_orthogonal.to(dtype).to(device_type)
+    return (
+        v0_parallel.to(device=device, dtype=dtype),
+        v0_orthogonal.to(device=device, dtype=dtype),
+    )
acestep/models/base/apg_guidance.py-5-13 (1)

5-13: ⚠️ Potential issue | 🟠 Major

Add class and method docstrings and return type hints for MomentumBuffer.

Docstrings and return type hints are required for new/modified Python code per the coding guidelines.

📝 Proposed update
 class MomentumBuffer:
+    """Momentum accumulator for APG updates.
+
+    Args:
+        momentum: Exponential decay factor for the running average.
+    """

-    def __init__(self, momentum: float = -0.75):
+    def __init__(self, momentum: float = -0.75) -> None:
         self.momentum = momentum
         self.running_average = 0

-    def update(self, update_value: torch.Tensor):
+    def update(self, update_value: torch.Tensor) -> None:
+        """Update the running average in-place.
+
+        Args:
+            update_value: Latest update tensor.
+        """
         new_average = self.momentum * self.running_average
         self.running_average = update_value + new_average
acestep/models/base/apg_guidance.py-33-56 (1)

33-56: ⚠️ Potential issue | 🟠 Major

Add docstring, return type hint, and fix mutable default argument in apg_forward.

This function lacks mandatory docstring and return type annotations required by coding guidelines. The dims=[-1] mutable default argument should be None with runtime initialization.

📝 Proposed update
 def apg_forward(
     pred_cond: torch.Tensor,  # [B, C, T]
     pred_uncond: torch.Tensor,  # [B, C, T]
     guidance_scale: float,
     momentum_buffer: MomentumBuffer = None,
     eta: float = 0.0,
     norm_threshold: float = 2.5,
-    dims=[-1],
-):
+    dims=None,
+) -> torch.Tensor:
+    """APG guidance forward pass.
+
+    Args:
+        pred_cond: Conditional prediction tensor.
+        pred_uncond: Unconditional prediction tensor.
+        guidance_scale: Guidance strength.
+        momentum_buffer: Optional momentum accumulator.
+        eta: Parallel component mix factor.
+        norm_threshold: Norm clipping threshold.
+        dims: Dimensions to reduce over for norms/projections.
+    Returns:
+        Guided prediction tensor.
+    """
+    if dims is None:
+        dims = (-1,)
acestep/models/base/apg_guidance.py-107-176 (1)

107-176: ⚠️ Potential issue | 🟠 Major

Fix numerical stability issue when sin(theta) is small and mark unused variable.

The division by sin(theta) occurs before the mask is applied (line 171-172), which can produce NaN/inf values that propagate through; use a safe denominator with torch.where to avoid this. Also, proj is assigned but never used (line 168)—rename to _proj.

Proposed fix
-    proj, perp = compute_perpendicular_component(latent_diff, latent_hat_uncond)
+    _proj, perp = compute_perpendicular_component(latent_diff, latent_hat_uncond)
     latent_v_new = torch.cos(latent_theta_new) * latent_hat_text
 
-    latent_p_new = perp * torch.sin(latent_theta_new) / torch.sin(latent_theta) * (
-        torch.sin(latent_theta) > 1e-3) + perp * weight * (torch.sin(latent_theta) <= 1e-3)
+    sin_theta = torch.sin(latent_theta)
+    sin_theta_new = torch.sin(latent_theta_new)
+    safe_denom = torch.where(sin_theta.abs() > 1e-3, sin_theta, torch.ones_like(sin_theta))
+    latent_p_new = perp * sin_theta_new / safe_denom
+    latent_p_new = torch.where(sin_theta.abs() > 1e-3, latent_p_new, perp * weight)
acestep/models/base/modeling_acestep_v15_base.py-1083-1091 (1)

1083-1091: ⚠️ Potential issue | 🟠 Major

CLS/special token is never prepended, yet its slot is used for pooling.

The special-token concat is commented out, but you later return hidden_states[:, 0, :], which now corresponds to the first input frame. Either prepend the CLS token (and pad the attention mask) or switch to a different pooling strategy.

🛠️ Proposed fix (prepend CLS and pad mask)
@@
-        # Prepend special token for timbre aggregation (CLS-like token)
-        # inputs_embeds = torch.cat([self.special_token.expand(inputs_embeds.shape[0], 1, -1), inputs_embeds], dim=1)
+        # Prepend special token for timbre aggregation (CLS-like token)
+        inputs_embeds = torch.cat(
+            [self.special_token.expand(inputs_embeds.shape[0], 1, -1), inputs_embeds], dim=1
+        )
+        if attention_mask is not None:
+            attention_mask = F.pad(attention_mask, (1, 0), value=1)

Also applies to: 1173-1177

acestep/models/base/modeling_acestep_v15_base.py-1-54 (1)

1-54: 🛠️ Refactor suggestion | 🟠 Major

Split this module; it exceeds the 200‑LOC hard cap.

This file is ~2k LOC and bundles utilities, encoders, diffusion, generation, and test harness logic. Please split by responsibility (or document a concrete split plan in the PR notes) before merge.

As per coding guidelines, "Target module size: optimal <= 150 LOC, hard cap <= 200 LOC; if exceeded, split by responsibility before merging or justify in PR notes with a concrete follow-up split plan".

acestep/models/base/modeling_acestep_v15_base.py-289-372 (1)

289-372: ⚠️ Potential issue | 🟠 Major

Add missing docstrings (and type hints where practical) for public callables.

The module lacks a top-level docstring, and several new/modified functions (e.g., AceStepAttention.forward, AceStepEncoderLayer.forward, AceStepDiTLayer.forward, AceStepDiTModel.forward, generate_audio, test_forward) have no docstrings. sample_t_r also lacks type hints. Please add concise docstrings covering purpose, key inputs/outputs, and exceptions, plus type hints where practical.

As per coding guidelines, "Docstrings are mandatory for all new or modified Python modules, classes, and functions; must be concise and include purpose plus key inputs/outputs and raised exceptions when relevant" and "Add type hints for new/modified functions when practical in Python".

acestep/models/base/modeling_acestep_v15_base.py-1636-1646 (1)

1636-1646: ⚠️ Potential issue | 🟠 Major

self.tokenize.quantizer will crash; replace with self.tokenizer.quantizer and drop print.

self.tokenize is a bound method, so the attribute access will raise when audio_codes is provided. Also avoid print in library code—use the module logger.

🛠️ Proposed fix
@@
-        if precomputed_lm_hints_25Hz is not None:
-            print("Using precomputed LM hints")
+        if precomputed_lm_hints_25Hz is not None:
+            logger.debug("Using precomputed LM hints")
             lm_hints_25Hz = precomputed_lm_hints_25Hz[:, :src_latents.shape[1], :]
         else:
             if audio_codes is not None:
-                lm_hints_5Hz = self.tokenize.quantizer.get_output_from_indices(audio_codes)
+                lm_hints_5Hz = self.tokenizer.quantizer.get_output_from_indices(audio_codes)
             else:
                 lm_hints_5Hz, indices, llm_mask = self.tokenize(hidden_states, silence_latent, attention_mask)
As per coding guidelines, "Keep logging actionable in Python; avoid noisy logs and `print` debugging in committed code".
acestep/models/base/modeling_acestep_v15_base.py-1381-1428 (1)

1381-1428: ⚠️ Potential issue | 🟠 Major

Provided padding masks are discarded; encoder mask uses the wrong source.

attention_mask and encoder_attention_mask are overwritten with None, so padding is never applied. In the non‑Flash path the encoder mask is also built from attention_mask instead of encoder_attention_mask, letting decoder tokens attend to padded encoder positions.

🛠️ Proposed fix (preserve caller masks)
@@
-        full_attn_mask = None
-        sliding_attn_mask = None
-        encoder_attention_mask = None
-        attention_mask = None
+        full_attn_mask = None
+        sliding_attn_mask = None
+        encoder_attn_mask = None
@@
-            full_attn_mask = attention_mask
+            full_attn_mask = attention_mask
@@
-            sliding_attn_mask = attention_mask if self.config.use_sliding_window else None
+            sliding_attn_mask = attention_mask if self.config.use_sliding_window else None
+            encoder_attn_mask = encoder_attention_mask
@@
-            full_attn_mask = create_4d_mask(
+            full_attn_mask = create_4d_mask(
                 seq_len=seq_len,
                 dtype=dtype,
                 device=device,
                 attention_mask=attention_mask,     # [B, L]
@@
-            encoder_attention_mask = create_4d_mask(
-                seq_len=max_len,
-                dtype=dtype,
-                device=device,
-                attention_mask=attention_mask,     # [B, L]
-                sliding_window=None,
-                is_sliding_window=False,
-                is_causal=False                    # <--- 关键:双向注意力
-            )
-            encoder_attention_mask = encoder_attention_mask[:, :, :seq_len, :encoder_seq_len]
+            if encoder_attention_mask is not None:
+                max_len = max(seq_len, encoder_seq_len)
+                enc_mask = encoder_attention_mask
+                if enc_mask.shape[1] != max_len:
+                    enc_mask = F.pad(enc_mask, (0, max_len - enc_mask.shape[1]), value=0)
+                encoder_attn_mask = create_4d_mask(
+                    seq_len=max_len,
+                    dtype=dtype,
+                    device=device,
+                    attention_mask=enc_mask,
+                    sliding_window=None,
+                    is_sliding_window=False,
+                    is_causal=False
+                )
+                encoder_attn_mask = encoder_attn_mask[:, :, :seq_len, :encoder_seq_len]
@@
         self_attn_mask_mapping = {
             "full_attention": full_attn_mask,
             "sliding_attention": sliding_attn_mask,
-            "encoder_attention_mask": encoder_attention_mask,
+            "encoder_attention_mask": encoder_attn_mask,
         }
acestep/models/base/configuration_acestep_v15.py-1-15 (1)

1-15: 🛠️ Refactor suggestion | 🟠 Major

Module exceeds 200 LOC; split or add a follow‑up plan.

This configuration module is well beyond the hard cap; please split by responsibility or document a concrete follow‑up split plan in PR notes.

As per coding guidelines, "Target module size: optimal <= 150 LOC, hard cap <= 200 LOC; if exceeded, split by responsibility before merging or justify in PR notes with a concrete follow-up split plan".

acestep/models/turbo/modeling_acestep_v15_turbo.py-1-20 (1)

1-20: 🛠️ Refactor suggestion | 🟠 Major

Module exceeds 200 LOC; split by responsibility or add a follow‑up plan.

This file is far beyond the hard cap; please split into focused modules or document a concrete follow‑up split plan in PR notes.

As per coding guidelines, "Target module size: optimal <= 150 LOC, hard cap <= 200 LOC; if exceeded, split by responsibility before merging or justify in PR notes with a concrete follow-up split plan".

acestep/models/sft/configuration_acestep_v15.py-1-15 (1)

1-15: 🛠️ Refactor suggestion | 🟠 Major

Module exceeds 200 LOC; split or add a follow‑up plan.

This configuration module is beyond the hard cap; please split by responsibility or document a concrete follow‑up split plan in PR notes.

As per coding guidelines, "Target module size: optimal <= 150 LOC, hard cap <= 200 LOC; if exceeded, split by responsibility before merging or justify in PR notes with a concrete follow-up split plan".

acestep/models/turbo/configuration_acestep_v15.py-1-15 (1)

1-15: 🛠️ Refactor suggestion | 🟠 Major

Module exceeds 200 LOC; split or add a follow‑up plan.

This configuration module is beyond the hard cap; please split by responsibility or document a concrete follow‑up split plan in PR notes.

As per coding guidelines, "Target module size: optimal <= 150 LOC, hard cap <= 200 LOC; if exceeded, split by responsibility before merging or justify in PR notes with a concrete follow-up split plan".

acestep/models/turbo/modeling_acestep_v15_turbo.py-333-346 (1)

333-346: ⚠️ Potential issue | 🟠 Major

Guard cache updates when position_embeddings is None.

If position_embeddings is omitted and cache is enabled, sin/cos are undefined and will raise at runtime.

🐛 Suggested guard
-            if past_key_value is not None:
-                # Sin and cos are specific to RoPE models; cache_position needed for the static cache
-                cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
-                key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
+            if past_key_value is not None:
+                if position_embeddings is None:
+                    raise ValueError("position_embeddings must be provided when using cache with RoPE.")
+                # Sin and cos are specific to RoPE models; cache_position needed for the static cache
+                cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+                key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
acestep/models/turbo/modeling_acestep_v15_turbo.py-1705-1716 (1)

1705-1716: ⚠️ Potential issue | 🟠 Major

Use the sampled r for timestep_r.

sample_t_r returns both t and r, but t is passed twice, which defeats the flow‑matching setup.

🐛 Suggested fix
-        decoder_outputs = self.decoder(
+        decoder_outputs = self.decoder(
             hidden_states=xt,
             timestep=t,
-            timestep_r=t,
+            timestep_r=r,
             attention_mask=attention_mask,
             encoder_hidden_states=encoder_hidden_states,
             encoder_attention_mask=encoder_attention_mask,
             context_latents=context_latents,
         )
acestep/models/turbo/modeling_acestep_v15_turbo.py-510-535 (1)

510-535: ⚠️ Potential issue | 🟠 Major

Avoid UnboundLocalError when cross‑attention is disabled.

cross_attn_weights is only defined when use_cross_attention=True, but you still append it when output_attentions=True.

🐛 Suggested fix
-        if self.use_cross_attention:
+        cross_attn_weights = None
+        if self.use_cross_attention:
             norm_hidden_states = self.cross_attn_norm(hidden_states).type_as(hidden_states)
             attn_output, cross_attn_weights = self.cross_attn(
@@
-        if output_attentions:
-            outputs += (self_attn_weights, cross_attn_weights)
+        if output_attentions:
+            outputs += (self_attn_weights, cross_attn_weights)
acestep/models/turbo/modeling_acestep_v15_turbo.py-1378-1425 (1)

1378-1425: ⚠️ Potential issue | 🟠 Major

Don’t drop provided attention masks.

attention_mask/encoder_attention_mask are overwritten to None, which discards padding info and can leak attention into padded tokens.

🐛 Suggested fix
-        encoder_attention_mask = None
-        attention_mask = None
+        # keep provided masks
acestep/models/sft/modeling_acestep_v15_base.py-1-20 (1)

1-20: 🛠️ Refactor suggestion | 🟠 Major

File exceeds 200 LOC hard cap — split by responsibility or document a concrete follow‑up plan.
This new module is ~2k+ LOC and bundles attention utilities, encoders, tokenization, diffusion, generation, and test harness. Please split into focused modules (or add a PR note with a concrete split plan) before merge.

As per coding guidelines, "Target module size: optimal <= 150 LOC, hard cap <= 200 LOC; if exceeded, split by responsibility before merging or justify in PR notes with a concrete follow-up split plan".

acestep/models/sft/modeling_acestep_v15_base.py-1837-1857 (1)

1837-1857: ⚠️ Potential issue | 🟠 Major

Non‑cover branch can crash when non_cover_text_* isn’t provided.
If audio_cover_strength < 1.0 and non_cover_text_hidden_states is None (default), text_projector will error.

🐛 Suggested fix
         if audio_cover_strength < 1.0:
             non_is_covers = torch.zeros_like(is_covers, device=is_covers.device, dtype=is_covers.dtype)
+            if non_cover_text_hidden_states is None:
+                non_cover_text_hidden_states = text_hidden_states
+                non_cover_text_attention_mask = text_attention_mask
             # Use silence_latent for non-cover condition to simulate text2music mode (no reference audio)
             silence_latent_expanded = silence_latent[:, :src_latents.shape[1], :].expand(src_latents.shape[0], -1, -1)
acestep/models/sft/modeling_acestep_v15_base.py-1708-1718 (1)

1708-1718: ⚠️ Potential issue | 🟠 Major

Flow‑matching samples r but the decoder always receives t.
r is computed yet ignored; this likely alters the intended objective.

🐛 Suggested fix
-        t, r = sample_t_r(bsz, device, dtype, self.config.data_proportion, self.config.timestep_mu, self.config.timestep_sigma, use_meanflow=False)
+        t, r = sample_t_r(bsz, device, dtype, self.config.data_proportion, self.config.timestep_mu, self.config.timestep_sigma, use_meanflow=False)
@@
-            timestep_r=t,
+            timestep_r=r,
acestep/models/sft/modeling_acestep_v15_base.py-513-538 (1)

513-538: ⚠️ Potential issue | 🟠 Major

cross_attn_weights can be undefined when use_cross_attention=False.
When output_attentions=True and cross-attention is disabled, this raises UnboundLocalError.

🐛 Suggested fix
-        # Step 2: Cross-attention (if enabled) for conditioning on encoder outputs
-        if self.use_cross_attention:
+        cross_attn_weights = None
+        # Step 2: Cross-attention (if enabled) for conditioning on encoder outputs
+        if self.use_cross_attention:
             norm_hidden_states = self.cross_attn_norm(hidden_states).type_as(hidden_states)
             attn_output, cross_attn_weights = self.cross_attn(
                 hidden_states=norm_hidden_states,
                 encoder_hidden_states=encoder_hidden_states,
                 attention_mask=encoder_attention_mask,
                 past_key_value=past_key_value,
                 output_attentions=output_attentions,
                 use_cache=use_cache,
                 **kwargs,
             )
             # Standard residual connection for cross-attention
             hidden_states = hidden_states + attn_output
@@
-        if output_attentions:
-            outputs += (self_attn_weights, cross_attn_weights)
+        if output_attentions:
+            outputs += (self_attn_weights,) if cross_attn_weights is None else (self_attn_weights, cross_attn_weights)
acestep/models/sft/modeling_acestep_v15_base.py-1381-1448 (1)

1381-1448: ⚠️ Potential issue | 🟠 Major

attention_mask / encoder_attention_mask are overwritten, dropping padding.
The incoming masks are set to None, so padding is ignored and cross‑attention masks are built from None.

🐛 Suggested fix
-        encoder_attention_mask = None
-        attention_mask = None
+        # Preserve incoming attention_mask/encoder_attention_mask for padding-aware masks
+        encoder_attention_mask_4d = encoder_attention_mask
         if is_flash_attn:
@@
-            full_attn_mask = attention_mask
+            full_attn_mask = attention_mask
@@
-            sliding_attn_mask = attention_mask if self.config.use_sliding_window else None
+            sliding_attn_mask = attention_mask if self.config.use_sliding_window else None
+            encoder_attention_mask_4d = encoder_attention_mask
@@
-            encoder_attention_mask = create_4d_mask(
+            encoder_attention_mask_4d = None
+            if encoder_attention_mask is not None:
+                encoder_attention_mask_4d = create_4d_mask(
                 seq_len=max_len,
                 dtype=dtype,
                 device=device,
-                attention_mask=attention_mask,     # [B, L]
+                attention_mask=encoder_attention_mask,     # [B, L]
                 sliding_window=None,
                 is_sliding_window=False,
                 is_causal=False                    # <--- 关键:双向注意力
-            )
-            encoder_attention_mask = encoder_attention_mask[:, :, :seq_len, :encoder_seq_len]
+                )
+                encoder_attention_mask_4d = encoder_attention_mask_4d[:, :, :seq_len, :encoder_seq_len]
@@
-        self_attn_mask_mapping = {
+        self_attn_mask_mapping = {
             "full_attention": full_attn_mask,
             "sliding_attention": sliding_attn_mask,
-            "encoder_attention_mask": encoder_attention_mask,
+            "encoder_attention_mask": encoder_attention_mask_4d,
         }
🟡 Minor comments (26)
acestep/models/sft/apg_guidance.py-168-168 (1)

168-168: ⚠️ Potential issue | 🟡 Minor

Unused variable proj.

The proj variable from compute_perpendicular_component is unpacked but never used. Use underscore prefix to indicate intentional discard.

Proposed fix
-    proj, perp = compute_perpendicular_component(latent_diff, latent_hat_uncond)
+    _proj, perp = compute_perpendicular_component(latent_diff, latent_hat_uncond)
acestep/models/sft/apg_guidance.py-74-76 (1)

74-76: ⚠️ Potential issue | 🟡 Minor

Potential division by zero when normalizing.

If either tensor has zero norm, division by zero will produce nan/inf. Consider adding epsilon for numerical stability.

Proposed fix
-    tensor1 = tensor1 / torch.linalg.norm(tensor1, dim=1, keepdim=True)
-    tensor2 = tensor2 / torch.linalg.norm(tensor2, dim=1, keepdim=True)
+    eps = 1e-8
+    tensor1 = tensor1 / (torch.linalg.norm(tensor1, dim=1, keepdim=True) + eps)
+    tensor2 = tensor2 / (torch.linalg.norm(tensor2, dim=1, keepdim=True) + eps)
acestep/models/base/apg_guidance.py-183-204 (1)

183-204: ⚠️ Potential issue | 🟡 Minor

Add return type hint and correct sigma type annotation for adg_w_norm_forward.

The wrapped function adg_forward accepts both float and torch.Tensor for sigma (lines 142–153), yet the wrapper's type hint incorrectly declares sigma: float. Add the return type hint -> torch.Tensor and broaden the sigma type to torch.Tensor | float to match the actual behavior.

Suggested update
 def adg_w_norm_forward(
     latents: torch.Tensor,
     noise_pred_cond: torch.Tensor,
     noise_pred_uncond: torch.Tensor,
-    sigma: float,
+    sigma: torch.Tensor | float,
     guidance_scale: float,
     angle_clip: float = 3.14 / 3,
-):
+) -> torch.Tensor:
acestep/models/base/apg_guidance.py-207-220 (1)

207-220: ⚠️ Potential issue | 🟡 Minor

Add return type hint and broaden sigma type for adg_wo_clip_forward to match adg_forward's actual accepted types.

The adg_forward function accepts both torch.Tensor and numeric types for sigma (lines 142–153), while this wrapper only declares sigma: float. Add the return type hint and broaden the parameter type for consistency.

🧩 Proposed update
 def adg_wo_clip_forward(
     latents: torch.Tensor,
     noise_pred_cond: torch.Tensor,
     noise_pred_uncond: torch.Tensor,
-    sigma: float,
+    sigma: torch.Tensor | float,
     guidance_scale: float,
-):
+) -> torch.Tensor:
acestep/models/base/apg_guidance.py-80-104 (1)

80-104: ⚠️ Potential issue | 🟡 Minor

Add type hints, document the ValueError in docstring, and use reshape instead of view for safer tensor operations.

The function is missing type annotations on parameters and return type, and the docstring doesn't document the raised ValueError. Additionally, .view() can fail on non-contiguous tensors; use .reshape() consistently for all tensor reshaping operations.

🧩 Proposed update
-def compute_perpendicular_component(latent_diff, latent_hat_uncond):
+def compute_perpendicular_component(
+    latent_diff: torch.Tensor,
+    latent_hat_uncond: torch.Tensor,
+) -> tuple[torch.Tensor, torch.Tensor]:
     """
     Decompose latent_diff into parallel and perpendicular components relative to latent_hat_uncond.
@@
     Returns:
         projection: Parallel component
         perpendicular_component: Perpendicular component
+    Raises:
+        ValueError: If input tensors have mismatched shapes.
     """
     n, t, c = latent_diff.shape
-    latent_diff = latent_diff.view(n * t, c).float()
-    latent_hat_uncond = latent_hat_uncond.view(n * t, c).float()
+    latent_diff = latent_diff.reshape(n * t, c).float()
+    latent_hat_uncond = latent_hat_uncond.reshape(n * t, c).float()
acestep/models/base/apg_guidance.py-63-77 (1)

63-77: ⚠️ Potential issue | 🟡 Minor

Add type hints and use F.cosine_similarity for numerical stability in call_cos_tensor.

The function lacks type hints, which the coding guidelines require for new/modified functions. Additionally, manually dividing by norm creates a divide-by-zero vulnerability that produces NaNs for zero-norm inputs. Using F.cosine_similarity(..., eps=...) is clearer, handles this edge case safely, and maintains the same output shape and values.

🧩 Proposed update
-def call_cos_tensor(tensor1, tensor2):
+def call_cos_tensor(
+    tensor1: torch.Tensor,
+    tensor2: torch.Tensor,
+    eps: float = 1e-8,
+) -> torch.Tensor:
     """
     Calculate cosine similarity between two normalized tensors.
@@
-    tensor1 = tensor1 / torch.linalg.norm(tensor1, dim=1, keepdim=True)
-    tensor2 = tensor2 / torch.linalg.norm(tensor2, dim=1, keepdim=True)
-    cosvalue = torch.sum(tensor1 * tensor2, dim=1, keepdim=True)
-    return cosvalue
+    return F.cosine_similarity(tensor1, tensor2, dim=1, eps=eps).unsqueeze(1)
acestep/models/base/modeling_acestep_v15_base.py-2066-2069 (1)

2066-2069: ⚠️ Potential issue | 🟡 Minor

Ruff F841/F811 in test harness (unused vars, duplicate import).

cur_hidden_states/cur_attention_mask are assigned but never used, and os is re-imported in __main__. These will fail Ruff if enforced; remove them or prefix with _, and drop the duplicate import.

🛠️ Proposed fix
@@
-        cur_hidden_states = torch.randn(2, seq_len, 64, dtype=model_dtype, device=device)
-        cur_attention_mask = torch.ones(2, seq_len, dtype=model_dtype, device=device)
+        # cur_hidden_states / cur_attention_mask removed (unused)
@@
-    import os

Also applies to: 2126-2134

acestep/models/base/modeling_acestep_v15_base.py-1931-1941 (1)

1931-1941: ⚠️ Potential issue | 🟡 Minor

use_cache parameter is ignored in the decoder call.

You always pass use_cache=True, so callers cannot disable caching. Honor the argument or drop it from the API.

🛠️ Proposed fix
@@
-                    use_cache=True,
+                    use_cache=use_cache,
acestep/models/base/modeling_acestep_v15_base.py-513-538 (1)

513-538: ⚠️ Potential issue | 🟡 Minor

Guard cross_attn_weights when cross‑attention is disabled.

If use_cross_attention=False and output_attentions=True, cross_attn_weights is undefined and will raise. Initialize it to None and/or only append when cross‑attn is enabled.

🛠️ Proposed fix
@@
-        # Step 2: Cross-attention (if enabled) for conditioning on encoder outputs
-        if self.use_cross_attention:
+        cross_attn_weights = None
+        # Step 2: Cross-attention (if enabled) for conditioning on encoder outputs
+        if self.use_cross_attention:
             norm_hidden_states = self.cross_attn_norm(hidden_states).type_as(hidden_states)
             attn_output, cross_attn_weights = self.cross_attn(
                 hidden_states=norm_hidden_states,
@@
         outputs = (hidden_states,)
         if output_attentions:
             outputs += (self_attn_weights, cross_attn_weights)
acestep/models/base/modeling_acestep_v15_base.py-1454-1463 (1)

1454-1463: ⚠️ Potential issue | 🟡 Minor

enable_early_exit currently has no effect.

max_needed_layer is computed but never used to truncate the loop, so early exit is silently ignored. Implement the break or remove the flag/config.

🛠️ Proposed fix
@@
-        max_needed_layer = float('inf')
-        if custom_layers_config is not None and enable_early_exit:
-            max_needed_layer = max(custom_layers_config.keys())
+        max_needed_layer = None
+        if custom_layers_config is not None and enable_early_exit:
+            max_needed_layer = max(custom_layers_config.keys())
@@
-        for index_block, layer_module in enumerate(self.layers):
+        for index_block, layer_module in enumerate(self.layers):
+            if max_needed_layer is not None and index_block > max_needed_layer:
+                break
acestep/models/turbo/configuration_acestep_v15.py-148-189 (1)

148-189: ⚠️ Potential issue | 🟡 Minor

Add a function docstring and type hints for __init__.

AceStepConfig.__init__ is new but lacks a concise function docstring and typing for its parameters/return.

As per coding guidelines, "Docstrings are mandatory for all new or modified Python modules, classes, and functions; must be concise and include purpose plus key inputs/outputs and raised exceptions when relevant" and "Add type hints for new/modified functions when practical in Python".

acestep/models/base/configuration_acestep_v15.py-148-189 (1)

148-189: ⚠️ Potential issue | 🟡 Minor

Add a function docstring and type hints for __init__.

AceStepConfig.__init__ is new but lacks a concise function docstring and typing for its parameters/return.

As per coding guidelines, "Docstrings are mandatory for all new or modified Python modules, classes, and functions; must be concise and include purpose plus key inputs/outputs and raised exceptions when relevant" and "Add type hints for new/modified functions when practical in Python".

acestep/models/sft/configuration_acestep_v15.py-148-189 (1)

148-189: ⚠️ Potential issue | 🟡 Minor

Add a function docstring and type hints for __init__.

AceStepConfig.__init__ is new but lacks a concise function docstring and typing for its parameters/return.

As per coding guidelines, "Docstrings are mandatory for all new or modified Python modules, classes, and functions; must be concise and include purpose plus key inputs/outputs and raised exceptions when relevant" and "Add type hints for new/modified functions when practical in Python".

acestep/models/turbo/modeling_acestep_v15_turbo.py-615-617 (1)

615-617: ⚠️ Potential issue | 🟡 Minor

Fix the lyric‑encoder assertion message.

The assertion checks input_ids is None but the message says only input_ids is supported.

✏️ Suggested fix
-        assert input_ids is None, "Only `input_ids` is supported for the lyric encoder."
+        assert input_ids is None, "Only `inputs_embeds` is supported for the lyric encoder."
acestep/models/turbo/modeling_acestep_v15_turbo.py-1635-1637 (1)

1635-1637: ⚠️ Potential issue | 🟡 Minor

Replace print with structured logging.

Avoid print debugging in committed code; use the module logger instead.

🔧 Suggested fix
-            print("Using precomputed LM hints")
+            logger.debug("Using precomputed LM hints")

As per coding guidelines, "Keep logging actionable in Python; avoid noisy logs and print debugging in committed code".

acestep/models/base/configuration_acestep_v15.py-150-216 (1)

150-216: ⚠️ Potential issue | 🟡 Minor

Avoid a mutable default for fsq_input_levels.

Using a list as a default argument can create shared state across instances.

🛠️ Suggested fix
-        fsq_input_levels=[8, 8, 8, 5, 5, 5],
+        fsq_input_levels=None,
@@
-        self.fsq_input_levels = fsq_input_levels
+        if fsq_input_levels is None:
+            fsq_input_levels = [8, 8, 8, 5, 5, 5]
+        self.fsq_input_levels = list(fsq_input_levels)

As per coding guidelines, "Avoid hidden state and unintended side effects in Python code".

acestep/models/sft/configuration_acestep_v15.py-150-216 (1)

150-216: ⚠️ Potential issue | 🟡 Minor

Avoid a mutable default for fsq_input_levels.

Using a list as a default argument can create shared state across instances.

🛠️ Suggested fix
-        fsq_input_levels=[8, 8, 8, 5, 5, 5],
+        fsq_input_levels=None,
@@
-        self.fsq_input_levels = fsq_input_levels
+        if fsq_input_levels is None:
+            fsq_input_levels = [8, 8, 8, 5, 5, 5]
+        self.fsq_input_levels = list(fsq_input_levels)

As per coding guidelines, "Avoid hidden state and unintended side effects in Python code".

acestep/models/turbo/configuration_acestep_v15.py-150-216 (1)

150-216: ⚠️ Potential issue | 🟡 Minor

Avoid a mutable default for fsq_input_levels.

Using a list as a default argument can create shared state across instances.

🛠️ Suggested fix
-        fsq_input_levels=[8, 8, 8, 5, 5, 5],
+        fsq_input_levels=None,
@@
-        self.fsq_input_levels = fsq_input_levels
+        if fsq_input_levels is None:
+            fsq_input_levels = [8, 8, 8, 5, 5, 5]
+        self.fsq_input_levels = list(fsq_input_levels)

As per coding guidelines, "Avoid hidden state and unintended side effects in Python code".

acestep/models/turbo/modeling_acestep_v15_turbo.py-1577-1806 (1)

1577-1806: ⚠️ Potential issue | 🟡 Minor

Add docstrings and type hints to public methods.

tokenize, prepare_condition, get_x0_from_noise, renoise, and generate_audio lack docstrings. Additionally, tokenize, get_x0_from_noise, and renoise omit parameter type hints and return types; generate_audio and prepare_condition lack return type annotations.

Per coding guidelines, docstrings are mandatory for all new or modified functions (including purpose, key inputs/outputs, and exceptions), and type hints should be added for new/modified functions when practical.

acestep/models/sft/modeling_acestep_v15_base.py-1527-1554 (1)

1527-1554: ⚠️ Potential issue | 🟡 Minor

Add a docstring for AceStepConditionEncoder.forward (and other new public methods).
Several new public methods lack docstrings; please add concise purpose/inputs/outputs/exceptions.

As per coding guidelines, "Docstrings are mandatory for all new or modified Python modules, classes, and functions; must be concise and include purpose plus key inputs/outputs and raised exceptions when relevant".

acestep/models/sft/modeling_acestep_v15_base.py-1-13 (1)

1-13: ⚠️ Potential issue | 🟡 Minor

Add a module docstring for this new file.
A concise module docstring is required for new modules.

✍️ Suggested module docstring
+# NOTE: keep the license header above.
+"""AceStep v1.5 base modeling components (attention, encoders, diffusion, tokenization)."""
As per coding guidelines, "Docstrings are mandatory for all new or modified Python modules, classes, and functions; must be concise and include purpose plus key inputs/outputs and raised exceptions when relevant".
acestep/models/sft/modeling_acestep_v15_base.py-172-197 (1)

172-197: ⚠️ Potential issue | 🟡 Minor

Add type hints + return type for sample_t_r.
This function is new/modified and lacks practical type hints.

✍️ Suggested fix
-def sample_t_r(batch_size, device, dtype, data_proportion=0.0, timestep_mu=-0.4, timestep_sigma=1.0, use_meanflow=True):
+def sample_t_r(
+    batch_size: int,
+    device: torch.device,
+    dtype: torch.dtype,
+    data_proportion: float = 0.0,
+    timestep_mu: float = -0.4,
+    timestep_sigma: float = 1.0,
+    use_meanflow: bool = True,
+) -> tuple[torch.Tensor, torch.Tensor]:
As per coding guidelines, "Add type hints for new/modified functions when practical in Python".
acestep/models/sft/modeling_acestep_v15_base.py-1454-1460 (1)

1454-1460: ⚠️ Potential issue | 🟡 Minor

max_needed_layer is unused; early‑exit never happens (ruff F841).
Either remove it or use it to stop the layer loop.

🧹 Suggested fix
-        for index_block, layer_module in enumerate(self.layers):
+        for index_block, layer_module in enumerate(self.layers):
+            if index_block > max_needed_layer:
+                break
acestep/model_downloader.py-56-115 (1)

56-115: ⚠️ Potential issue | 🟡 Minor

Add type hints for checkpoints_dir in the new sync helpers.
These functions are new and can be annotated cleanly.

✍️ Suggested fix
-from typing import Optional, List, Dict, Tuple
+from typing import Optional, List, Dict, Tuple, Union
@@
-def _check_code_mismatch(model_name: str, checkpoints_dir) -> List[str]:
+def _check_code_mismatch(model_name: str, checkpoints_dir: Union[Path, str]) -> List[str]:
@@
-def _sync_model_code_files(model_name: str, checkpoints_dir) -> List[str]:
+def _sync_model_code_files(model_name: str, checkpoints_dir: Union[Path, str]) -> List[str]:
As per coding guidelines, "Add type hints for new/modified functions when practical in Python".
acestep/models/sft/modeling_acestep_v15_base.py-1638-1640 (1)

1638-1640: ⚠️ Potential issue | 🟡 Minor

Replace print with logger to avoid debug noise.
print() in committed code is noisy and hard to control in production.

🧼 Suggested fix
-            print("Using precomputed LM hints")
+            logger.info("Using precomputed LM hints")
As per coding guidelines, "Keep logging actionable in Python; avoid noisy logs and `print` debugging in committed code".
acestep/handler.py-616-625 (1)

616-625: ⚠️ Potential issue | 🟡 Minor

Remove unnecessary f prefix in log string.

The f-string has no placeholders, so Ruff is right to warn.

🔧 Suggested fix
-                logger.info(f"[initialize_service] Model code files synced successfully.")
+                logger.info("[initialize_service] Model code files synced successfully.")
🧹 Nitpick comments (11)
acestep/models/sft/apg_guidance.py (6)

1-14: Unused import and missing class docstring.

  • Line 2: torch.nn.functional as F is imported but never used.
  • MomentumBuffer class is missing a docstring describing its purpose and usage.
Proposed fix
 import torch
-import torch.nn.functional as F


 class MomentumBuffer:
+    """
+    Accumulates momentum-weighted running average of update values for APG guidance.
+
+    Args:
+        momentum: Momentum coefficient (default: -0.75, negative for dampening).
+    """

     def __init__(self, momentum: float = -0.75):

16-30: Mutable default argument and missing docstring.

dims=[-1] is a mutable default argument. While lists are mutable, this specific usage is read-only and won't cause bugs, but it's a Python anti-pattern that static analyzers flag (B006). Replace with None and default inside the function for consistency.

The function also lacks a docstring explaining its purpose (projection decomposition).

Proposed fix
 def project(
     v0: torch.Tensor,  # [B, C, T]
     v1: torch.Tensor,  # [B, C, T]
-    dims=[-1],
-):
+    dims: list[int] | None = None,
+) -> tuple[torch.Tensor, torch.Tensor]:
+    """
+    Project v0 onto v1, returning parallel and orthogonal components.
+
+    Args:
+        v0: Source tensor [B, C, T]
+        v1: Direction tensor [B, C, T]
+        dims: Dimensions for normalization (default: [-1])
+
+    Returns:
+        Tuple of (parallel component, orthogonal component)
+    """
+    if dims is None:
+        dims = [-1]
     dtype = v0.dtype

33-56: Missing docstring and mutable default.

Same mutable default issue on dims=[-1]. More importantly, this function implements the APG (Adaptive Projected Gradient) guidance algorithm but lacks documentation explaining the algorithm, parameters, and expected behavior.

Proposed fix
 def apg_forward(
     pred_cond: torch.Tensor,  # [B, C, T]
     pred_uncond: torch.Tensor,  # [B, C, T]
     guidance_scale: float,
     momentum_buffer: MomentumBuffer = None,
     eta: float = 0.0,
     norm_threshold: float = 2.5,
-    dims=[-1],
-):
+    dims: list[int] | None = None,
+) -> torch.Tensor:
+    """
+    APG (Adaptive Projected Gradient) guidance forward pass.
+
+    Args:
+        pred_cond: Conditional prediction [B, C, T]
+        pred_uncond: Unconditional prediction [B, C, T]
+        guidance_scale: Guidance strength multiplier
+        momentum_buffer: Optional momentum buffer for smoothing
+        eta: Parallel component weight (default: 0.0)
+        norm_threshold: Maximum norm for diff scaling (default: 2.5)
+        dims: Dimensions for projection (default: [-1])
+
+    Returns:
+        Guided prediction tensor [B, C, T]
+    """
+    if dims is None:
+        dims = [-1]
     diff = pred_cond - pred_uncond

59-60: Missing type hints and docstring.

This function lacks type annotations and a docstring. While simple, it should document what CFG (Classifier-Free Guidance) does.

Proposed fix
-def cfg_forward(cond_output, uncond_output, cfg_strength):
+def cfg_forward(
+    cond_output: torch.Tensor,
+    uncond_output: torch.Tensor,
+    cfg_strength: float,
+) -> torch.Tensor:
+    """Apply Classifier-Free Guidance: lerp from uncond toward cond."""
     return uncond_output + cfg_strength * (cond_output - uncond_output)

111-111: Type hint doesn't reflect actual accepted types.

The signature declares sigma: torch.Tensor but the function body handles int, float, and torch.Tensor (lines 142-153). Update the type hint to reflect this.

Proposed fix
-    sigma: torch.Tensor,
+    sigma: torch.Tensor | float | int,

183-220: Type hints for sigma inconsistent with adg_forward.

Both wrapper functions declare sigma: float but delegate to adg_forward which accepts torch.Tensor | float | int. For API consistency, consider matching the broader type.

acestep/models/base/apg_guidance.py (1)

59-60: Consider adding docstring and type hints to cfg_forward if modifying this function.

Based on coding guidelines, docstrings and type hints are mandatory for new or modified functions. The current code lacks both. If you're touching this function, update it as follows:

Suggested update
-def cfg_forward(cond_output, uncond_output, cfg_strength):
+def cfg_forward(
+    cond_output: torch.Tensor,
+    uncond_output: torch.Tensor,
+    cfg_strength: float,
+) -> torch.Tensor:
+    """Classic CFG blend of conditional and unconditional outputs."""
     return uncond_output + cfg_strength * (cond_output - uncond_output)
acestep/models/turbo/modeling_acestep_v15_turbo.py (2)

169-194: Add type hints for sample_t_r.

This helper is new; typing improves readability and tooling support.

🔧 Example typing
-def sample_t_r(batch_size, device, dtype, data_proportion=0.0, timestep_mu=-0.4, timestep_sigma=1.0, use_meanflow=True):
+def sample_t_r(
+    batch_size: int,
+    device: torch.device,
+    dtype: torch.dtype,
+    data_proportion: float = 0.0,
+    timestep_mu: float = -0.4,
+    timestep_sigma: float = 1.0,
+    use_meanflow: bool = True,
+) -> tuple[torch.Tensor, torch.Tensor]:

As per coding guidelines, "Add type hints for new/modified functions when practical in Python".


2004-2156: Consider moving the test harness to tests/examples.

Keeping heavyweight demo code out of the model module improves maintainability and keeps runtime entrypoints clean.

acestep/inference.py (1)

72-81: Document cover_noise_strength in GenerationParams docstring.
This new parameter is user‑facing; please add it to the attribute list.

✍️ Suggested docstring update
         # Task-Specific Parameters
         task_type: Type of generation task. One of: "text2music", "cover", "repaint", "lego", "extract", "complete".
         reference_audio: Path to a reference audio file for style transfer or cover tasks.
         src_audio: Path to a source audio file for audio-to-audio tasks.
         audio_codes: Audio semantic codes as a string (advanced use, for code-control generation).
         repainting_start: For repaint/lego tasks: start time in seconds for region to repaint.
         repainting_end: For repaint/lego tasks: end time in seconds for region to repaint (-1 for until end).
         audio_cover_strength: Strength of reference audio/codes influence (range 0.0–1.0). set smaller (0.2) for style transfer tasks.
+        cover_noise_strength: Initialize noise from source audio (0 = pure noise/no cover, 1 = closest to source).
         instruction: Optional task instruction prompt. If empty, auto-generated by system.
acestep/gradio_ui/events/generation_handlers.py (1)

723-723: Use explicit Optional type for nullable parameter.

The current_mode parameter can be None but the type hint only specifies str. Per PEP 484, this should use an explicit Optional or union type to accurately describe the expected types.

🔧 Proposed fix
-def get_model_type_ui_settings(is_turbo: bool, current_mode: str = None, is_pure_base: bool = False):
+def get_model_type_ui_settings(is_turbo: bool, current_mode: Optional[str] = None, is_pure_base: bool = False):

Note: Optional is already imported from typing at the top of the file.

Comment on lines +1642 to +1645
if audio_codes is not None:
lm_hints_5Hz = self.tokenize.quantizer.get_output_from_indices(audio_codes)
else:
lm_hints_5Hz, indices, llm_mask = self.tokenize(hidden_states, silence_latent, attention_mask)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

self.tokenize is a method — this will raise AttributeError.
The quantizer lives on self.tokenizer, not on the tokenize method.

🐛 Suggested fix
-                lm_hints_5Hz = self.tokenize.quantizer.get_output_from_indices(audio_codes)
+                lm_hints_5Hz = self.tokenizer.quantizer.get_output_from_indices(audio_codes)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if audio_codes is not None:
lm_hints_5Hz = self.tokenize.quantizer.get_output_from_indices(audio_codes)
else:
lm_hints_5Hz, indices, llm_mask = self.tokenize(hidden_states, silence_latent, attention_mask)
if audio_codes is not None:
lm_hints_5Hz = self.tokenizer.quantizer.get_output_from_indices(audio_codes)
else:
lm_hints_5Hz, indices, llm_mask = self.tokenize(hidden_states, silence_latent, attention_mask)
🧰 Tools
🪛 Ruff (0.15.0)

[warning] 1645-1645: Unpacked variable indices is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[warning] 1645-1645: Unpacked variable llm_mask is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents
In `@acestep/models/sft/modeling_acestep_v15_base.py` around lines 1642 - 1645,
The code mistakenly treats self.tokenize as an object with a quantizer
attribute, causing AttributeError; change references so the quantizer is
accessed from self.tokenizer (e.g., replace
self.tokenize.quantizer.get_output_from_indices(...) with
self.tokenizer.quantizer.get_output_from_indices(...)) and ensure the tokenize
method call remains as self.tokenize(hidden_states, silence_latent,
attention_mask) to obtain lm_hints_5Hz, indices, llm_mask; verify symbols
touched: self.tokenize, self.tokenizer, get_output_from_indices, lm_hints_5Hz,
indices, llm_mask.

Comment on lines +1639 to +1641
if audio_codes is not None:
lm_hints_5Hz = self.tokenize.quantizer.get_output_from_indices(audio_codes)
else:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

audio_codes path calls self.tokenize.quantizer (AttributeError).

self.tokenize is a method, so .quantizer doesn’t exist. This will crash when audio_codes is provided.

🐛 Suggested fix
-                lm_hints_5Hz = self.tokenize.quantizer.get_output_from_indices(audio_codes)
+                lm_hints_5Hz = self.tokenizer.quantizer.get_output_from_indices(audio_codes)
🤖 Prompt for AI Agents
In `@acestep/models/turbo/modeling_acestep_v15_turbo.py` around lines 1639 - 1641,
The bug is that code treats self.tokenize as an object with a .quantizer
attribute but self.tokenize is a method; change the call to obtain the tokenizer
object before accessing .quantizer. Replace the problematic access in
modeling_acestep_v15_turbo (the block that checks audio_codes) to call the
tokenizer (e.g., use self.tokenize() or the correct tokenizer property) and then
call .quantizer.get_output_from_indices(audio_codes) — for example use
self.tokenize().quantizer.get_output_from_indices(audio_codes) or, if a
quantizer lives on the instance, use
self.quantizer.get_output_from_indices(audio_codes) so the AttributeError is
avoided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant