Skip to content

Feat: Zyphra/ZAYA1-8B#2529

Draft
nreHieW wants to merge 27 commits into
PrimeIntellect-ai:mainfrom
nreHieW:feat/zaya
Draft

Feat: Zyphra/ZAYA1-8B#2529
nreHieW wants to merge 27 commits into
PrimeIntellect-ai:mainfrom
nreHieW:feat/zaya

Conversation

@nreHieW
Copy link
Copy Markdown

@nreHieW nreHieW commented May 17, 2026

Note: This PR is a WIP and uses non-official versions of transformers and vllm as both dependencies have yet to officially support Zyphra/ZAYA1-8B. Changes to uv.lock and pyproject.toml are added to this PR for demonstration purposes only and will be removed when upstream officially supports the model. Code cleanup will be needed for this PR when model support is upstreamed in both.

[WIP until both VLLM and Transformers merge support for Zaya1-8B upstream]

Adds PrimeRL custom model support for Zaya1-8B.

Summary

  • Adds ZayaConfig and ZayaForCausalLM custom model implementation.
  • Adds vLLM weight postprocessing for Zaya’s original alternating-layer layout.
  • Adds ZayaMoE, router, and expert-parallel support.
  • Adds Zaya context-parallel support for CCA attention
  • Adds unit/parity tests for:
    • HF vs PrimeRL forward/backward parity
    • expert-parallel MoE parity
    • context-parallel CCA attention parity

Sanity checks

  • Formating checks: uv run ruff format
  • Linting checks: uv run ruff check --fix

I ran RL sanity checks on Zaya with:

  • reverse-text
  • hendrycks-math

Reverse Text

image image

Hendryks Math

image image

Tests

uv run scripts/mini_moe.py --arch zaya --output-dir ./mini_zaya
image

uv run pytest tests/unit/train/models/test_zaya.py
image

Notes

  • When VLLM official ports over to the logic from the HF PR, much of the code can be simplified in this PR. Mainly because right now the custom conversion and vllm weight broadcast step is taking too long because conversion is expensive. That can be removed once VLLM and HF use the same implementation. For example, the changes in the following files can all be removed:
    • src/prime_rl/trainer/models/zaya/vllm_postprocessing.py
    • src/prime_rl/trainer/rl/broadcast/filesystem.py
    • src/prime_rl/trainer/rl/broadcast/nccl.py
    • Environmental reverts: uv.lock, pyproject.toml
  • This PR supports the 8B Zaya 1 model specifically. So there is no support right now for SWA
  • Changes made to tests/unit/train/models/test_qwen3_5_moe.py::test_qwen3_5_moe_cp_patching to ensure that it resets the flash attention method. Previously, it was the last model test to run so it did not affect anything. With Zaya, Zaya calls FlashAttention. Without this change, Zaya would be calling the patched FlashAttention (Resolved)
  • Changes made to src/prime_rl/inference/patches.py reflect latest changes with VLLM v0.21 which implements dual phase pausing, implementing it in this PR as well. ([codex] chore: bump vllm to 0.21.0 #2519 Resolved)
  • tests/unit/train/models/test_zaya.py::test_zaya does full roundtrip forward and backward tests with the actual HF model. This is slow and since other models don't do such a roundtrip test, we can remove it as necessary
  • In the CCA module, we cannot just do all to all for value_delayed. In this config, value_delayed = num_key_value_heads / 2 = 1 so it cannot be sharded over cp_size

nreHieW and others added 2 commits May 19, 2026 09:36
Signed-off-by: nrehiew <81154837+nreHieW@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant