Skip to content

Fix/causal conv1d cuda128#1354

Open
dzorlu wants to merge 872 commits intoNovaSky-AI:mainfrom
fleet-ai:fix/causal-conv1d-cuda128
Open

Fix/causal conv1d cuda128#1354
dzorlu wants to merge 872 commits intoNovaSky-AI:mainfrom
fleet-ai:fix/causal-conv1d-cuda128

Conversation

@dzorlu
Copy link
Copy Markdown

@dzorlu dzorlu commented Mar 20, 2026

Deniz and others added 30 commits January 27, 2026 15:38
* feat(fleet): add Tinker backend for Fleet task training

Add support for training on Fleet environments using Tinker (hosted) as
the training and inference backend. This provides an alternative to the
existing PyTorch/Ray/vLLM setup.

New files:
- main_fleet_tinker.py: Training entrypoint using Tinker API
  - Uses existing FleetTaskEnv for environment interaction
  - GRPO advantage estimation
  - Checkpoint management
  - WandB logging

- openenv-fleet-train-tinker.yaml: CI workflow
  - Much simpler than SkyPilot version (no GPU provisioning)
  - Tinker handles compute allocation
  - Same inputs (modality, env_key, max_tasks, etc.)

Required secrets:
- TINKER_API_KEY: Tinker hosted service authentication

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(tinker): complete Fleet+Tinker integration

- Add DictConfig wrapper for FleetTaskEnv (required by SkyRL's env)
- Configure Tinker ServiceClient with API URL and key from env vars
- Add advantage metrics (mean, std) to WandB logging
- Add per-environment rollout metrics (turns, tool_calls)
- Remove vLLM from CI (Tinker handles inference)
- Add TINKER_API_URL to CI environment

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat(tinker): complete Fleet+Tinker integration with SkyRL metrics

Key changes:
- Use OpenEnv FleetTaskEnv directly with async methods (reset_async,
  step_async) to avoid nested asyncio.run() issues
- Add pass@k metrics matching SkyRL's implementation
- Add per-environment metrics (reward/{env}/pass_at_n, avg_score)
- Add evaluation metrics (eval/all/pass_at_1, per-env breakdown)
- Build system prompt with tools inline (like SkyRL's FleetTaskEnv)
- Proper task config normalization from JSON

WandB metrics now match SkyRL:
- reward/avg_pass_at_{n}: Overall pass@k
- reward/avg_raw_reward: Average reward
- reward/{env_key}/pass_at_{n}: Per-environment pass@k
- eval/all/pass_at_1: Evaluation pass@1
- eval/{env_key}/pass_at_1: Per-environment eval

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* style: format fill_results_from_wandb.py with black

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: clarify TINKER_API_URL is optional

The Tinker SDK uses a default endpoint if TINKER_API_URL is not set.
Only TINKER_API_KEY is required for authentication.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Temporarily send Slack notifications to test channel while validating
the Tinker integration.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
The Fleet MCP client requires the 'mcp' package which is an optional
dependency in OpenEnv. Install with openenv[fleet] to get both mcp
and fleet-python dependencies.

Fixes: ModuleNotFoundError: No module named 'mcp'

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Qwen2.5-1.5B-Instruct is not supported by Tinker API.
Switch to Qwen3-VL-30B-A3B-Instruct which is available.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
apply_chat_template can return BatchEncoding dict instead of plain
list on some tokenizers. Tinker's ModelInput.from_ints() requires
a plain list of integers.

Added tokenize_chat() helper to handle both cases.

Fixes: EncodedTextChunk validation error

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
- Pass single ModelInput, not a list
- Add required num_samples=1 argument

API signature per docs:
  sample(prompt, num_samples, sampling_params, ...)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
SampledSequence uses 'tokens' not 'token_ids' per API docs:
- stop_reason: Reason why sampling stopped
- tokens: List of generated token IDs
- logprobs: Log probabilities for each token

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
)

* fix(tinker): add max_input_length check to prevent context overflow

Match SkyRL's approach in skyrl_gym_generator.py:274 - end rollout
when context exceeds max_input_length instead of hitting API error.

Changes:
- Add max_generate_length param (renamed from max_tokens)
- Add max_input_length param (default 30720 = 32768 - 2048)
- Check context length at start of each turn, break with stop_reason="length"
- Track and return stop_reason in rollout output

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* chore: use #fleet-training-runs Slack channel

---------

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* fix(tinker): truncate overlong sequences with DAPO filtering

Match SkyRL's approach: truncate sequences exceeding max_sequence_length
and zero out their loss mask (DAPO overlong filtering). This prevents
the Tinker API error while keeping sequences in the batch.

Changes:
- Add max_sequence_length param (default 32768)
- Truncate sequences > max_sequence_length to fit model context
- Zero out loss mask for truncated sequences (won't contribute to loss)
- Track truncated_overlong count in metrics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: add Tinker development guidelines to CLAUDE.md

* Add unit tests for sequence truncation and overlong filtering

- Add integrations/fleet/utils.py with refactored pure Python functions:
  - truncate_sequence: truncate prompt+response to max_sequence_length
  - truncate_auxiliary_data: truncate logprobs/loss_mask to match
  - apply_overlong_filtering_simple: DAPO filtering (zero mask if no EOS)
  - prepare_training_sequence: combined truncation logic

- Add integrations/fleet/tests/test_tinker_training.py with 20 tests:
  - TestTruncateSequence: sequence truncation behavior
  - TestTruncateAuxiliaryData: logprobs/mask truncation
  - TestOverlongFiltering: DAPO EOS-based filtering
  - TestPrepareTrainingSequence: combined preparation
  - TestCombinedFlow: full DAPO + truncation flow

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add duration timer to collect_fleet_rollout

- Track rollout duration in collect_fleet_rollout and error cases
- Log per-environment duration metrics: rollout/{env_key}/duration
- Log overall duration stats: rollout/avg_duration, max_duration, min_duration

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Refactor main_fleet_tinker.py to use FleetTaskEnv from env.py

- Add init_async() and step_async() methods to FleetTaskEnv in env.py
  - Async methods contain the actual logic (await OpenEnv's async methods)
  - Sync methods (init, step) are thin wrappers using asyncio.run()
  - Enables both sync (SkyRL generator) and async (Tinker) usage

- Refactor main_fleet_tinker.py to use the shared FleetTaskEnv wrapper:
  - Import FleetTaskEnv from integrations.fleet.env
  - Use env.init_async() for initialization (handles system prompt, tools)
  - Use env.step_async() for stepping (handles tool parsing, chat history)
  - Access env.chat_history for tokenization, env.turns/tool_calls for metrics

- Remove duplicated code from main_fleet_tinker.py:
  - load_tasks_from_json (now in env.py)
  - build_system_prompt (handled by env.init_async)
  - parse_tool_call (handled by env.step_async)
  - Manual chat history management

Single source of truth for Fleet environment logic.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Extract prepare_training_data() function (matching SkyRL pattern)

Refactor training data preparation into a dedicated function, similar to
SkyRL's generate_batched pattern:

- prepare_training_data() handles:
  1. DAPO overlong filtering (zero loss mask if no EOS)
  2. Sequence truncation for max_sequence_length
  3. Building Tinker Datum objects

- main() is now cleaner, focused on orchestration:
  - Rollout collection
  - Metrics computation
  - prepare_training_data() call
  - Training step

This improves code organization, testability, and matches SkyRL's
separation of concerns between rollout collection and data preparation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Extract compute_rollout_metrics() and update tests

- Add compute_rollout_metrics() function for all rollout metrics:
  - Core reward metrics (pass@n, avg_reward, mean_positive)
  - Advantage stats (mean, std)
  - Rollout counts (valid, total)
  - Per-environment metrics
  - Per-environment rollout stats (turns, tool_calls, duration)
  - Overall duration stats

- Update tests:
  - Add docstring explaining tests validate prepare_training_data pattern
  - Add test_batch_processing for multi-rollout scenarios

main() is now cleaner with clear separation:
1. Rollout collection
2. Advantage computation
3. compute_rollout_metrics()
4. prepare_training_data()
5. Training step

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add logging for invalid rollouts

Log each invalid rollout with:
- task_key
- error message (or "no response_ids")
- stop_reason

Also track rollouts/invalid metric in wandb.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add Fleet integration architecture diagram

Document showing:
- Training loops (Tinker async vs SkyRL sync)
- SkyRL FleetTaskEnv wrapper (sync/async methods)
- OpenEnv FleetTaskEnv (low-level)
- Fleet Platform
- Communication flow and data structures

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Remove unused pytest import (ruff fix)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Improve architecture diagram: add function details, simplify layout

* Fix diagram: SkyRL uses vLLM on GPU, not local inference

---------

Co-authored-by: Deniz <deniz@Mac.localdomain>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Previously, rollouts were collected sequentially in a for loop, waiting
for each rollout to complete before starting the next. Since rollouts
are independent and I/O-bound (HTTP calls to Fleet + Tinker), they can
run concurrently.

With batch_size=8 and n_samples_per_prompt=4, this could be up to 32x
faster for rollout collection.

Co-authored-by: Deniz <deniz@Mac.localdomain>
- Change default model from Qwen3-VL-30B to Qwen3-8B
- Fix Python syntax error in task count command (bash escaping issue)

Co-authored-by: Deniz <deniz@Mac.localdomain>
* fix(ci): add cancelled notification and default to 50 steps

- Add Slack notification when training run is cancelled
- Change default max_steps from 200 to 50 for faster iteration

* Rename Tinker -> SkyRL in Slack notifications

* Revert "Rename Tinker -> SkyRL in Slack notifications"

This reverts commit 000226361ce19b51dae9859c2d69663fcd03d77e.

* Rename Fleet -> SkyRL in SkyPilot workflow notifications

* Suppress MCP client INFO logs (mcp.client.streamable_http)

* Consolidate Slack notifications: generic headers with Backend field

- Change headers from 'Tinker/SkyRL Training' to just 'Training'
- Add 'Backend' field (Tinker or SkyRL) in the message body
- Change channel to #fleet-training-runs-test for testing

---------

Co-authored-by: Deniz <deniz@Mac.localdomain>
* feat: add progress logging during rollout collection

Log progress at ~25%, 50%, 75%, 100% completion:
  Progress: 8/32 rollouts completed
  Progress: 16/32 rollouts completed
  Progress: 24/32 rollouts completed
  Progress: 32/32 rollouts completed

Uses asyncio.as_completed to track progress while maintaining parallel execution.

* fix(tinker): use dict access for TypedDict step output

BaseTextEnvStepOutput is a TypedDict, not a class with attributes.
Use dict access (["observations"], ["reward"], ["done"]) instead of
attribute access (.observations, .reward, .done).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat(tinker): add timestamps to log messages

Logs now show HH:MM:SS timestamps for easier debugging:
  12:34:56 INFO __main__: Step 0: Collecting rollouts...

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Deniz <deniz@Mac.localdomain>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* fix(tinker): limit concurrent Fleet env connections with semaphore

Fleet MCP connections timeout when too many are opened simultaneously.
Add semaphore to limit concurrent rollouts to 4 (configurable via
max_concurrent parameter).

With batch_size=8 and n_samples_per_prompt=4, we were trying to open
32 MCP connections at once, causing connection timeouts.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* refactor(tinker): use Pydantic RolloutOutput instead of dict

Replace dict returns with typed Pydantic model for better validation
and IDE support. Fields:
- prompt_ids, response_ids, logprobs, loss_mask (sequences)
- reward, task_key, env_key, turns, tool_calls, stop_reason, duration
- error (optional, for failed rollouts)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Deniz <deniz@Mac.localdomain>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* fix(tinker): reduce max concurrent rollouts from 4 to 2

Fleet MCP connections still timing out with 4 concurrent rollouts.
Reduce to 2 to further decrease pressure on Fleet infrastructure.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(tinker): use larger GitHub runner (4-cores) and restore concurrency

- Use ubuntu-latest-4-cores (4 vCPU, 16GB RAM) instead of ubuntu-latest
- Restore max_concurrent to 4 (larger runner can handle more connections)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(tinker): update WandB run name format to match SkyRL

Change from: tinker_tool_use_0a7278bc_20260128-2127
Change to:   fleet-tool-use-0a7278bc-20260128-2127

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Deniz <deniz@Mac.localdomain>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Shows step progress with key metrics:
  Training:  10%|██        | 5/50 [02:30<22:30, pass@4=0.125, reward=0.05, time=30.1s]

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Mac.localdomain>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Mac.localdomain>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Quick reference for SkyRL and Tinker training runs:
- Backend comparison
- GitHub Actions parameters
- Monitoring (Slack, WandB)
- Testing small runs
- Troubleshooting common issues
- Required secrets

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Mac.localdomain>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
- Revert runner from ubuntu-latest-4-cores to ubuntu-latest
  (larger runners require org-level enablement)
- Reduce max_concurrent from 4 to 2 to prevent MCP timeouts

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Mac.localdomain>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* fix: disable pip cache for OpenEnv to get latest code

* fix: use ThreadPoolExecutor for env ops to isolate MCP connections

Match SkyRL's pattern - run env.init/step in threads so each gets
its own event loop and isolated httpx connections. Fixes MCP timeout
issues caused by shared connection pool contention.

* feat: increase ThreadPoolExecutor max_workers to 16

* feat: increase max_concurrent to 8 (safe with ThreadPoolExecutor)

---------

Co-authored-by: Deniz <deniz@Mac.localdomain>
502 Bad Gateway errors were occurring exactly 10 minutes into rollouts
because instances were hitting TTL expiration.

Co-authored-by: Deniz <deniz@Mac.localdomain>
Co-authored-by: Deniz <deniz@Mac.localdomain>
#85)

Two issues causing training failures:

1. TTL of 30 min still not enough - some rollouts with many turns take
   30+ minutes, causing 502 Bad Gateway when instances expire. Increased
   to 7200s (2 hours).

2. Pydantic attribute access bug - RolloutOutput is a Pydantic model but
   code was using dict-style `.get()` access. Fixed to use attribute
   access for filtering and added `rollout_to_dict()` helper for metrics
   functions that expect dict format.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Log each turn completion with:
- gen: Tinker generation time
- step: Fleet environment step time (MCP tool call)
- total: total turn time
- toks: tokens generated
- reward: current reward
- status: DONE or ...

Example output:
[task_key] Turn 1: gen=2.3s step=1.5s total=3.8s toks=156 reward=0.00 ...
[task_key] Turn 2: gen=1.8s step=0.9s total=2.7s toks=89 reward=1.00 DONE

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* feat(fleet): add step timing logs to FleetTaskEnv

Add timing instrumentation to compare SkyRL vs Tinker performance:

Init logs:
  [task_key] Init: env=3.2s reset=8.5s total=11.7s tools=100

Step logs:
  [task_key] Turn 1: step=85.2s mcp=85.0s tool=search reward=0.00 ...
  [task_key] Turn 2: step=42.1s mcp=42.0s tool=click reward=1.00 DONE

Also adds step_time and mcp_time to step metadata for downstream analysis.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat(tinker): add timing metrics to WandB

Track generation and Fleet step times in WandB:

Timing metrics:
- time/gen_total, time/gen_mean: Tinker generation time
- time/step_total, time/step_mean: Fleet MCP step time
- time/gen_pct, time/step_pct: Percentage breakdown

Throughput metrics:
- throughput/tokens_total: Total tokens generated
- throughput/tokens_per_sec_gen: Generation throughput (Tinker only)
- throughput/tokens_per_sec_effective: End-to-end throughput (including MCP)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* fix(tinker): use async sampling to avoid event loop blocking

The synchronous `.result()` call on `sampling_client.sample()` was blocking
the event loop, causing concurrent rollouts to serialize instead of running
in parallel. This resulted in 100+ second step times when the actual MCP
calls only took ~1 second.

Changed to use `sample_async()` with double await pattern:
- First await returns a future
- Second await gets the result

This allows the event loop to continue processing other rollouts while
waiting for Tinker generation to complete.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* chore: use #fleet-training-runs channel (remove -test)

* chore: remove turn-based console logging (keep timing in metadata for WandB)

---------

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
dzorlu and others added 28 commits March 2, 2026 10:36
#275)

Some v5 environments have large MCP tool schemas that can push the
initial prompt past max_input_length before any generation happens.
This causes response_end_idx to remain None, crashing with:
TypeError: unsupported operand type(s) for -: 'NoneType' and 'int'

Changes:
- Bump MAX_INPUT_LENGTH from 48000 to 64000 (all model configs)
- Bump YaRN rope_scaling factor from 2.0 to 2.5 for Qwen3-8B
  (32768 * 2.5 = 81920 effective context, enough for 64K + 8K generate)
- Qwen3-32B rope unchanged (40960 * 2.0 = 81920, already sufficient)
- Add guard in agent_loop for when prompt exceeds max_input_length
  before any generation: logs warning and returns zero-reward trajectory

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
With max_input_length=64000, forward_backward OOMs on H100s.
Halve micro batch sizes:
- 8B (4gpu): 2 → 1
- 8B (8gpu): 4 → 2
- 8B step-wise: already at 1, no change

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Modality is already defined in each SkyPilot task YAML's MODALITY env var,
so the workflow input was redundant. WANDB key selection now uses the
task-derived modality instead of hardcoding tool_use.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Default TTL is now None, which lets OpenEnv auto-select based on
modality. Can be overridden via env_config.ttl_seconds in task YAML.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SkyRL's default.yaml hardcoded ttl_seconds=600, which overrode
OpenEnv's 900s default. Logfire confirmed the orchestrator received
600s for every instance in the training run, causing premature
expiry and downstream 502 Bad Gateway errors on tool calls.

Co-authored-by: Deniz <deniz@Mac.localdomain>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* feat: add GCP spot H200 as fallback compute option in all task YAMLs

Adds GCP preemptible H200 entries to all SkyPilot task YAML configs
as the last fallback option. GCP project fleet-compute-489000 has 64
preemptible H200 GPUs per region. Storage remains on S3 (cross-cloud).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add GCP to GHA workflow (cloud credentials, sky check, dropdown)

- Add GCP service account key setup in Configure Cloud Credentials step
- Install google-api-python-client and google-cloud-storage on runner
- Add gcp to sky check verification step
- Add gcp option to cloud dropdown in workflow_dispatch

Requires GCP_SERVICE_ACCOUNT_KEY secret to be added to the repo.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: update test to match ttl_seconds=900 default

The default was changed to 900 in config but the test still asserted 600.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Deniz <deniz@Mac.localdomain>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* feat: make Qwen3.5-9B the default tool-use training model

Adds a new task YAML for Qwen3.5-9B in text-only (tool_use) mode,
adapted from the CU version on feat/qwen3.5-9b. Key differences from
the Qwen3-8B config: v51 dataset, gpu_memory_utilization=0.8,
no YaRN (native 262K context), flash_attn=false (torch 2.10 compat),
vLLM nightly + transformers from source for Qwen3.5 support, and
CUDA toolkit install for FlashInfer JIT (GatedDeltaNet kernels).

Updates the GHA workflow to make qwen3_5-9b the default task.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: remove xlam-70b and glm-4.7-flash from workflow options

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: set max_prompt_length to MAX_INPUT_LENGTH (64000)

2048 tokens was filtering out tasks with long tool schema lists.
Since max_prompt_length controls the initial prompt filter in
dataset.py, it should match the input budget so no valid tasks
are dropped.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use feat/qwen3.5-9b branch for workdir (vLLM nightly compat)

The vLLM nightly stack (torch 2.10, newer Ray, transformers source)
requires compatibility fixes that only exist on feat/qwen3.5-9b:
- Ray import fallback for ray.experimental.collective.util removal
- torch 2.10 Parameter.__new__ patch for accelerate compat
- return_dict=False for newer transformers apply_chat_template
- VL model detection in model_wrapper (Qwen3.5 config is VL)
- FSDP2 set-to-list fix

main branch code crashes immediately with ModuleNotFoundError.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: cherry-pick vLLM nightly compat fixes instead of pointing to VL branch

Instead of using feat/qwen3.5-9b (which includes all VL/multimodal code),
cherry-pick only the minimal compatibility fixes needed for the nightly stack:

- inference_engines/utils.py: Ray import fallback for removed
  ray.experimental.collective.util module
- distributed/fsdp_utils.py: torch 2.10 Parameter.__new__ patch for
  accelerate compat + FSDP2 set-to-list fix
- generators/utils.py + skyrl_gym_generator.py: return_dict=False for
  newer transformers apply_chat_template (returns dict by default now)
- dataset/dataset.py: same return_dict=False fix

workdir.ref now points to this branch (feat/qwen3.5-9b-tool-use).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: update mock tokenizer to accept **kwargs (return_dict compat)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: patch ALLOWED_LAYER_TYPES for vLLM nightly compat

vLLM nightly dev267+ imports ALLOWED_LAYER_TYPES from
transformers.configuration_utils, but transformers main
split it into ALLOWED_ATTENTION_LAYER_TYPES + ALLOWED_MLP_LAYER_TYPES.
Add a post-install patch to create the alias.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: skip missing FSDP layer classes instead of raising

Qwen3.5-9B's _no_split_modules includes Qwen3_5VisionBlock which
doesn't exist when loading as AutoModelForCausalLM (text-only).
Skip missing classes with a warning instead of raising, and only
raise if NO classes are found.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: pre-download model weights to avoid HuggingFace race condition

When 4 FSDP workers + 4 vLLM engines all try to download model weights
simultaneously without auth, HuggingFace rate-limits the requests causing
OSError: file not found. Pre-download in setup phase before any parallel
processes start.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use python for model pre-download instead of huggingface-cli

huggingface-cli not on PATH in SkyPilot setup environment despite venv
being activated. Use huggingface_hub.snapshot_download() directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: limit vLLM max_model_len and disable experimental Mamba prefix caching

Qwen3.5-9B has native 262K context, but vLLM auto-detects this as
max_model_len causing massive KV cache + Mamba state allocation. Combined
with experimental Mamba cache alignment, this causes NCCL ALLREDUCE
timeouts (600s) as memory pressure stalls collective operations.

- Set max_model_len=73728 (64K input + 8K generate + padding)
- Disable prefix caching (experimental for Mamba/GatedDeltaNet layers)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use Hydra + prefix to append engine_init_kwargs

Hydra strict mode requires + prefix when adding keys not in the base
config struct. engine_init_kwargs is an empty dict by default, so
max_model_len and enable_prefix_caching need +key=value syntax.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove enable_prefix_caching from engine_init_kwargs

SkyRL's ray_wrapped_inference_engine already passes
enable_prefix_caching as a separate kwarg. Adding it to
engine_init_kwargs causes duplicate keyword argument error.
max_model_len=73728 alone should fix the NCCL timeout by reducing
memory allocation from 262K to 73K context.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: download model to local dir to prevent HF cache race condition

snapshot_download to HF cache still causes shard resolution failures
when 4 FSDP workers load concurrently. Download to $HOME/models/ with
local_dir and verify all shards present before starting training.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add NCCL debug logging and reduce max_model_len to 32K

Previous runs show 12 min of GPU activity then silent crash — consistent
with NCCL timeout. Add RAY_DEDUP_LOGS=0, NCCL_DEBUG=WARN for better
error visibility. Reduce max_model_len from 73K to 32K to rule out
memory pressure with GatedDeltaNet/linear attention state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use transformers backend for vLLM (Qwen3_5 arch not natively supported)

vLLM nightly does not have native support for Qwen3_5ForConditionalGeneration.
The auto fallback doesn't work (throws ValidationError instead of falling back).
Explicitly set model_impl=transformers to use HuggingFace Transformers backend
for inference via vLLM's serving infrastructure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: pin vLLM to Qwen3.5 support commit instead of unpinned nightly

The nightly extra-index-url resolves non-deterministically, sometimes
picking a build without Qwen3_5ForConditionalGeneration support.
Pin to commit 9562912 (PR #34110) which added Qwen3.5 architecture.
Also add verification that the vLLM module exists before proceeding.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: pop hf_overrides from engine_init_kwargs to prevent duplicate kwarg

engine_init_kwargs is spread as **kwargs alongside rope_engine_kwargs
which also contains hf_overrides. Using .get() leaves hf_overrides in
both dicts, causing TypeError for duplicate keyword argument. Convert
to regular dict and use .pop() to remove it from engine_init_kwargs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use nightly index-url and patch config.json for TransformersForCausalLM

The pinned vLLM commit wheel wasn't being resolved properly, falling
back to stable which lacks Qwen3_5ForConditionalGeneration support.
Switch to --index-url for nightly, and if the nightly still doesn't
have native Qwen3.5, patch config.json to use TransformersForCausalLM
generic backend. Remove model_impl from engine_init_kwargs (handled
by config.json patch instead).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add language_model_only=true for vLLM Qwen3.5 multimodal arch

Qwen3.5-9B uses Qwen3_5ForConditionalGeneration (multimodal arch).
vLLM nightly doesn't support this arch directly for text-only.
The --language-model-only flag tells vLLM to skip the vision encoder
and use the text-only path, which resolves the architecture error.

Also restore max_model_len to 73728 (was accidentally lowered to 32768).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: always use TransformersForCausalLM backend, remove language_model_only

The NCCL ALLREDUCE timeout (SeqNum=865) during weight sync is caused by
weight name mismatches between FSDP and vLLM. FSDP loads Qwen3_5ForCausalLM
via AutoModelForCausalLM, but vLLM's native handler uses different naming.

Fix: unconditionally patch config.json to TransformersForCausalLM so vLLM
also loads via AutoModelForCausalLM → same Qwen3_5ForCausalLM class → identical
weight names for NCCL sync.

Remove language_model_only=true: this flag is for native multimodal handlers.
TransformersForCausalLM already loads text-only via AutoModelForCausalLM, and
the flag may cause errors when applied to a non-multimodal model wrapper.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add --retry-until-up for GPU capacity scarcity

SkyPilot will continuously retry all providers until resources are
available, bounded by the 72h workflow timeout on the self-hosted runner.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: install transformers from git LAST to prevent litellm downgrade

litellm pins transformers<5.0, overriding the git install (5.3.0.dev0)
back to 4.57.6 which lacks transformers.models.qwen3_5. Move git
install after all other pip installs with --no-deps to prevent this.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove _MODELS import that broke on vLLM 0.13.0 nightly

The setup check `from vllm.model_executor.models import _MODELS` fails
on vLLM 0.13.0+ where that API was removed. Replace with a simple
version assertion since we always patch config.json to
TransformersForCausalLM anyway.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add 8-GPU fallback options (B200:8, H100:8, A100-80GB:8)

When 4-GPU instances are sold out across all providers, fall back to
8-GPU configs. The 9B model fits easily and config uses
$SKYPILOT_NUM_GPUS_PER_NODE dynamically.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use --pre --extra-index-url for vLLM nightly install

--index-url breaks dependency resolution since wheels.vllm.ai/nightly
only has vllm wheels, causing uv to keep stable 0.13.0. Use
--extra-index-url to keep PyPI for deps + --pre to accept dev versions.

Also fix version check: require 'dev' in version string (0.13.0 falsely
passed the old >=13 check despite being stable without Qwen3.5 support).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: force-reinstall transformers and assert version >= 4.58

Previous runs failed because transformers 4.57.x was still installed
despite the git install. Use --force-reinstall to guarantee the git
version overwrites, and add an assertion that fails fast if the wrong
version is installed (instead of cryptic errors during training).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: temporarily limit qwen3.5-9b to GCP-only to verify GCP works

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add GCP support to training workflow, update dataset to v52, VL disk/GPU changes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: default data_version to v52 in GHA workflow

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: install gcloud CLI for GCP support in training workflow

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: clean runner workspace before checkout to prevent stale temp files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove runner cleanup step that was deleting runner state files

The rm -rf $RUNNER_TEMP/* was nuking _runner_file_commands created
during job setup, causing checkout to fail with "Missing file at path".
Use checkout's built-in clean: true instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use correct gcloud CLI download URL

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: restart SkyPilot API server before sky check so it picks up gcloud in PATH

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use 8x GPUs for GCP spot (4x not available on GCP)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: upgrade huggingface_hub before transformers dev (is_offline_mode import)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: disable MTP and remove MTP weights from index for TransformersForCausalLM

Qwen3.5-9B has model.mtp (multi-token prediction) weights that the
TransformersForCausalLM backend can't load. Set num_nextn_predict_layers=0
in config.json and strip MTP entries from safetensors index.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: align tool-use task YAML with working VL branch setup

- Remove TransformersForCausalLM config patching and MTP weight stripping
- Remove pre-download to local dir (use Qwen/Qwen3.5-9B directly from HF)
- Remove --pre flag, huggingface_hub upgrade, ALLOWED_LAYER_TYPES patch
- Use python directly instead of uv run --isolated
- Match VL branch GPU config (8x), disk (750GB), and training params
- Use simpler transformers install (uv pip install -U, not --force-reinstall)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: set MAX_INPUT_LENGTH to 64000 for tool_use (text-only doesn't need 131K)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove retry logic from training workflow

SkyPilot --retry-until-up handles provisioning retries internally.
The bash-level retry loop with 2-4-8 minute delays was redundant.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: detect VL models and use correct model class for FSDP training

Qwen3.5-9B is natively multimodal (has vision_config), so vLLM loads
Qwen3_5ForConditionalGeneration with weights under language_model.model.*.
FSDP was using AutoModelForCausalLM (weights under model.*), causing
NCCL weight sync timeout due to parameter name mismatch.

Backport VL model detection from feat/qwen3.5-9b: load config first,
check for vision_config, and use AutoModelForImageTextToText when detected.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Enable thinking mode for Qwen3.5-9B via chat_template_kwargs

Qwen3.5 requires enable_thinking=True in the chat template to activate
reasoning mode. Without this, the model never generates <think> tokens —
confirmed 0/177 eval trajectories had thinking across all 14 environments.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix Hydra override: use + prefix for chat_template_kwargs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: stop pre-existing Ray before starting training cluster

SkyPilot on GCP starts its own Ray on port 6380 which claims all GPUs.
The training Ray on port 6479 then can't allocate placement groups.
Force-stop all Ray instances before starting ours.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: reuse existing Ray instead of stopping it

SkyPilot on GCP runs jobs via its own Ray (port 6380). The previous
ray stop --force killed the job runner itself. Instead, detect if Ray
is already running and reuse it; only start a new one if none exists.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: disable GCP resources, revert Ray to explicit port 6479

GCP disabled until SkyPilot Ray conflict is resolved (internal Ray on
port 6380 conflicts with training Ray on 6479). Reverted ray status
checks to explicit --address 127.0.0.1:6479.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Enable GCP spot fallback with NVIDIA driver 570+ image

SkyPilot's default GCP GPU image (skypilot-gcp-gpu-ubuntu-241030) ships
NVIDIA driver 535.216.01 which does not support H200 GPUs (cuInit returns
CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE). Fix by specifying a Google
Deep Learning VM image with driver 570 and CUDA 12.8.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Remove 6-GPU fallbacks for Qwen3.5-9B

6x GPU configs OOM during reference model forward pass — without
flash-attn cross-entropy, logprobs_from_logits materializes full
(seqlen, vocab) tensors which exceed per-GPU memory with 6-way FSDP.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Disable GCP spot fallback

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use chunked gather+logsumexp for bf16 logprobs to prevent OOM

Without flash-attn, F.log_softmax materializes a full (seqlen, vocab)
float32 tensor which OOMs for long sequences with large vocabs
(e.g. 64K * 151K * 4B = 36 GiB, or 325K * 151K * 4B = 184 GiB).
Replace with chunked gather+logsumexp in float32 per 1024-token chunk,
reducing peak memory from O(seqlen * vocab) to O(CHUNK_SIZE * vocab).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use adaptive chunk size in logprobs_from_logits_v2 to avoid OOM during training

During training forward_backward, FSDP + gradients + activations consume ~128 GiB
leaving only ~300 MiB free. The fixed 1024-token chunk upcasted to float32 needs
~970 MiB (1024 * 248320 vocab * 4 bytes) which OOMs. Now dynamically computes
chunk size targeting ~64 MiB per chunk based on vocab size (~67 tokens for
Qwen3.5-9B's 248K vocab).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: ensure CUDA package version consistency after vllm nightly install

vllm nightly can upgrade some nvidia CUDA packages (e.g. cusparse to 12.9)
while leaving others at 12.8 (e.g. nvjitlink), causing ImportError:
"undefined symbol: __nvJitLinkGetErrorLogSize_12_9, version libnvJitLink.so.12"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: re-enable flash-attn with torch 2.10 prebuilt wheel

Install community-built flash-attn 2.8.3 wheel for torch 2.10 + CUDA 12.8
instead of uninstalling flash-attn. This re-enables:
- Flash attention kernels for training (faster than SDPA fallback)
- Efficient triton cross_entropy_loss for logprobs computation

Set trainer.flash_attn=true to use flash attention during training.

Wheel source: Dao-AILab/flash-attention#2299

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: chunked lm_head to avoid OOM on large-vocab models

Avoids materializing the full (B, S, vocab_size) logits tensor during
training by computing lm_head + logprobs in chunks along the sequence
dimension. Each chunk uses gradient checkpointing so only one chunk's
logits are in memory at a time.

For Qwen3.5-9B (248K vocab) with 64K token sequences, this reduces
peak logits memory from ~32 GiB to ~1.9 GiB (chunk_size=4096).

Ported from upstream SkyRL's JAX backend (PR NovaSky-AI#902, loss_chunk_size).

Also disables eval_before_train for faster prototyping.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: create v53 dataset (v52 minus github) and update default

v53 removes all 428 github tasks from tool_use (3382→2954).
Computer use unchanged (613 tasks, no github in v52).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: handle FSDP2 DTensor in chunked lm_head

FSDP2 wraps parameters as DTensors. Calling lm_head module inside
gradient_checkpoint caused DTensor/Tensor mismatch. Fix:
- Extract weight/bias and convert DTensor→full_tensor (differentiable all-gather)
- Use F.linear instead of module call
- Skip gradient_checkpoint for ref model (no_grad context)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add flash-linear-attention for Qwen3.5 GatedDeltaNet kernels

Without FLA, 24 of 32 layers fall back to unoptimized torch
implementation making backward pass ~30x slower.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add flash-linear-attention to pyproject.toml dependencies

Remove causal-conv1d from override-dependencies (which blocked it)
and add both causal-conv1d and flash-linear-attention to main deps.
Required for Ray workers to have FLA via uv runtime env hook.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: upgrade to vLLM 0.17.0 stable for Qwen3.5 support

- vLLM 0.17.0 has native Qwen3.5/GDN support + FlashAttention 4
- No longer need vllm nightly, nvidia package fixups, or flashinfer cleanup
- Pin torch==2.10.0 (from 2.9.0) to match vLLM 0.17.0
- FLA still in pyproject.toml for training model (HF transformers)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove flashinfer-jit-cache from vllm extra to fix version mismatch

vLLM 0.17.0 brings flashinfer-python 0.6.4 but flashinfer-jit-cache
resolves to 0.5.3, causing RuntimeError on engine startup. Remove
jit-cache from vllm extra (keep for mcore only).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* style: fix black formatting in model_wrapper.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Deniz <deniz@Mac.localdomain>
Root overlay (200GB) ran out of space during checkpoint save.
Volume (/workspace, 500GB) has plenty of room. Move:
- checkpoints: /workspace/ckpts/
- eval dumps: /workspace/exports/
- dataset: /workspace/data/fleet/
- Ray tmp: /workspace/skyrl-tmp/

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Adds trace config to FleetTaskEnv and uploads conversation traces
(including screenshots) at episode end during eval. Trace job is
created in trainer.eval() when FLEET_API_KEY is set.

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Adds `partial_reward` config option under `fleet_task` and passes it
through to OpenEnv's FleetTaskEnv. Off by default.

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
- Point workdir to main (feat/qwen3.5-9b-tool-use merged)
- Use OpenEnv@deniz/fleet_client (PR #1) instead of @deniz/fleet-logfire
- Enable eval_before_train for step-0 baseline
- Increase eval to 8 tasks/env (MAX_EVAL_PROMPTS 60→96, min_per_env 4→8)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When SkyRL ends a trajectory early (context overflow), the verifier
never ran and the model got 0 reward. Now:
- Fleet env exposes close_async() which calls OpenEnv's close_async()
  and reads final_reward from the verifier
- Generator moves get_metrics() after _env_close() and injects
  final_reward into per_step_rewards for context-overflow trajectories

Requires fleet-ai/OpenEnv feat/close-verifier branch.

Co-authored-by: Deniz <deniz@Mac.localdomain>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* fix: raise ulimit for open files in training run script

Ray + 8 vLLM engines + Fleet MCP connections exhaust the default 1024
file descriptor soft limit, causing "Too many open files" errors that
hang training. Set ulimit -n 65536 at the start of the run block.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: also raise ulimit in VL training YAML

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* fix: raise ulimit for open files in training run script

Ray + 8 vLLM engines + Fleet MCP connections exhaust the default 1024
file descriptor soft limit, causing "Too many open files" errors that
hang training. Set ulimit -n 65536 at the start of the run block.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: also raise ulimit in VL training YAML

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update Qwen3.5-9B training config to dataset v54

v54 patches 12 cross-contaminated verifiers identified in the v53 eval set.
See fleet-research-scripts/training-data-pipeline/v5/verifier-contamination-diagnosis.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix OpenEnv install to force-reinstall from git

Version number doesn't bump on every commit, so uv caches the old
wheel. Use --force-reinstall --no-cache-dir to always get latest.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* fix: pass@n requires full success (>= 1.0), not just positive reward

With partial_reward mode, tasks can return fractional rewards (e.g. 0.3).
The old `> 0` check counted these as passes, inflating pass@n metrics.
Changed to `>= 1.0` so only fully solved tasks count as passes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: update per-dataset metric tests for strict pass@n threshold

Tests were using fractional rewards (0.5, 0.7, 0.9) and expecting
pass@n=1.0, but pass@n now requires >= 1.0 for a pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* feat: shared training scripts, multi-node support, Qwen3.5-35B-A3B config

Extract ~170 lines of inline shell from the 9B SkyPilot config into
reusable scripts (fleet-common-setup.sh, fleet-common-run.sh,
fleet-qwen35-extra-setup.sh) with multi-node Ray support via
head/worker pattern from GSM8K example.

- Add Qwen3.5-35B-A3B MoE config (2-node default, 16 GPUs)
- Enable GCP with correct H200/B200 image (driver 570)
- Add num_nodes workflow input (1/2/4) with --num-nodes passthrough
- Remove old Qwen3 configs from workflow choices

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: default OpenEnv branch to deniz/fleet_client

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: update README default branch to deniz/fleet_client

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: wire cloud input to sky launch --cloud flag

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: point workdir ref to branch for testing

Scripts don't exist on main yet. Will revert to main before merge.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: install build-essential if c++ missing (GCP images)

causal-conv1d needs c++ compiler for CUDA extension build.
GCP deep learning images don't include build-essential by default.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use sudo for apt-get on GCP (runs as gcpuser, not root)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: resolve extra-setup path to absolute before cd skyrl-train

The --extra-setup path is relative to repo root but cd skyrl-train
changes the working directory, breaking the relative path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add --no-pytorch-alloc-conf to 35B config

expandable_segments is incompatible with vLLM 0.17.0 memory pool.
The 9B config already passes this flag; 35B was missing it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: clean stale runner state before checkout to prevent log file conflicts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add --use-python-direct and --set-ulimit to 35B config

uv run --isolated creates a temp env that doesn't have flash_attn_2_cuda,
causing ModuleNotFoundError at runtime. --use-python-direct runs from the
venv directly where the flash-attn wheel was installed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add KillMode=control-group to runner setup to prevent zombie listeners

The default KillMode=process only kills runsvc.sh on restart, leaving
RunnerService.js and Runner.Listener children orphaned. With Restart=always
and WatchdogSec=300, this accumulates dozens of zombie listeners that fight
over _work/_temp/_runner_file_commands/, causing checkout failures.

KillMode=control-group kills the entire cgroup on restart.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add self-healing process health check to runner-health workflow

SSHs into each healthy runner every 15 min to validate KillMode=control-group
and kill zombie Runner.Listener processes. Prevents checkout failures from
accumulating orphaned listeners.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove nightly transformers install that breaks vLLM 0.17.0

The transformers main branch renamed `layer_type_validation` to
`validate_layer_type()`, breaking vLLM's qwen3_5_moe config import.
The locked version (4.57.3) has both Qwen3.5 support and the API
vLLM expects, so the nightly install is unnecessary and harmful.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Now that shared scripts are merged, the SkyPilot task YAMLs should
clone from main instead of the feature branch.

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
)

* fix: pin transformers==5.1.0 and disable enforce_eager for Qwen3.5

The FSDP worker failed with `qwen3_5_moe` unrecognized because the
resolved transformers version didn't include Qwen3.5-MoE support.
Pin to 5.1.0 in the extra-setup script.

Also add `generator.enforce_eager=false` to both 9B and 35B task
configs to allow CUDA graph compilation instead of eager mode.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: point workdir ref to main for both task configs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Deniz <deniz@Mac.localdomain>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* fix: pin transformers==4.57.3 (5.x renames qwen3_5_moe to qwen3_5_moe_text)

transformers 5.1.0 renamed the model type from `qwen3_5_moe` to
`qwen3_5_moe_text`, so AutoConfig.from_pretrained fails on the HF
checkpoint which still uses `qwen3_5_moe`. 4.57.3 has the correct
mapping.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use transformers==5.3.0 (first version with qwen3_5_moe support)

4.57.3 predates Qwen3.5 (Nov 2025 vs Feb 2026). 5.1.0 doesn't
register qwen3_5_moe in AUTO_CONFIG_MAPPING. 5.3.0 is confirmed
to have full support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Deniz <deniz@Mac.localdomain>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…at (#301)

On RunPod /workspace is persistent storage, but on GCP it doesn't exist
and mkdir /workspace fails with permission denied. Both setup and run
scripts now auto-detect: use /workspace if it exists and is writable,
otherwise fall back to $HOME. Explicit --data-root still works as override.

Also moves ckpt_path and export_path into the run script's common hydra
overrides so they use the resolved CKPT_ROOT instead of hardcoded paths.

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* Add hint-augmented rollouts to rescue GRPO signal on dead prompts

When all raw rollouts for a prompt score below threshold (default 0.0),
build a hint from verifier feedback (ERROR/SUCCESS_ACCUMULATOR + tool
errors) and run additional hinted rollouts. Hinted samples share the
same instance_id so GRPO groups them with raw samples, creating reward
variance where there was none.

No LLM call — hints are formatted verifier feedback. New env instances
per hinted rollout (no state leakage). Disabled by default
(enable_hints: false).

Config: enable_hints, hint_reward_threshold, n_hint_samples
Metrics: hint/prompts_hinted, hint/hint_success_rate, hint/signal_rescued

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: guard hint metrics against non-dict env_metrics

MagicMock objects from test mocks are truthy for .get("is_hinted"),
causing TypeError when comparing MagicMock > 0. Add isinstance check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: skip hint augmentation during eval

Hints are for rescuing GRPO training signal on dead prompts. During
eval, we want true model capability without hints. At step 0, nearly
all prompts fail, causing massive hint rollout storms that OOM the
raylet. Gate on sampling_params is None (training mode).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use batch_metadata.training_phase to gate hints instead of sampling_params

sampling_params is never None — both training and eval pass a dict via
get_sampling_params_for_backend(). The previous guard (sampling_params is None)
silently disabled hint augmentation in all cases. Use batch_metadata.training_phase
== "train" which correctly distinguishes training from eval.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use strict > for hint threshold so hints fire when all rewards are 0

With >= 0.0, prompts where all 4 samples scored exactly 0.0 were skipped
because 0.0 >= 0.0 is true. Changed to > so that threshold=0.0 means
"generate hints when max_reward is 0" (the intended behavior).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: re-derive uids when hint augmentation expands generator output

When hint augmentation adds extra rollouts (e.g. 96→116), the uids list
from prepare_generator_input has fewer entries than the generator output.
Re-derive uids from the input trajectory_ids (which the generator mutates
in-place when appending hinted rollouts) to fix the IndexError.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: include trajectory_ids in generator output for hint augmentation

The generator was reassigning trajectory_ids to a new list when appending
hints, so the trainer couldn't access the extended list. Two fixes:

1. Generator: extend trajectory_ids/env_classes in-place and always
   include trajectory_ids in the output (was None for non-step-wise).
2. Trainer: re-derive uids from generator_output["trajectory_ids"]
   when output size differs from input (hint augmentation adds rollouts).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* RLTF-SD: strip hint from training prompt for correct gradient

Replace hinted rollout prompt_ids with the original unhinted prompt_ids
so GRPO trains ∇θ log π(y_hint | x_0) instead of ∇θ log π(y_hint | x_0 + hint).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Log injected hint text for each hinted rollout

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: first-turn baseline for hint-augmented GRPO

Compute GRPO group-mean baseline from raw (unhinted) samples only,
preventing hinted samples from contaminating the baseline and causing
training instability (RLTF-SD paper, Section 3.2).

- Generator: emit `is_hinted` boolean array in output
- Trainer: thread `is_hinted` through metadata and into advantage fn
- ppo_utils: use raw-only mean/std when `is_hinted` is provided
- Tests: 4 new tests covering basic, no-hints, mixed-groups, std-norm

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Deniz <deniz@Mac.localdomain>
* debug: add diagnostics and disable Ray memory monitor for GCP FSDP crash

FSDP ref workers are SIGKILL'd during model init on GCP (works on RunPod).
Both 9B single-node and 35B multi-node affected.

Changes:
- fleet-common-run.sh: dump system diagnostics (cgroup limits, memory,
  GPU info) before training, disable Ray memory monitor
  (RAY_DISABLE_MEMORY_MONITOR=1), add NCCL_DEBUG=WARN, capture dmesg
  and memory state on training failure
- 9b YAML: add missing ckpt_path/export_path hydra overrides
  (matching 35B YAML pattern with $HOME instead of /workspace)

If the crash disappears with RAY_DISABLE_MEMORY_MONITOR=1, Ray's
memory monitor is the root cause. If it persists, dmesg output will
show whether it's OOM killer, segfault, or something else.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* debug: point workdir ref to debug branch for GCP FSDP crash diagnostics

The previous run used ref: main, so fleet-common-run.sh changes
(RAY_DISABLE_MEMORY_MONITOR, system diagnostics, crash diagnostics)
were never deployed. Point to debug branch temporarily.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* debug: improve FSDP crash diagnostics

- Always dump post-training diagnostics (Hydra exits 0 even on crash)
- Add HYDRA_FULL_ERROR=1 for complete stack traces
- Add fabric manager status check (NVSwitch on H200/B200)
- Add GPU topology dump (nvidia-smi topo -m)
- Add NVIDIA driver/CUDA version info
- Upgrade NCCL_DEBUG to INFO with INIT,NET subsystems
- Increase dmesg capture to 80 lines, add "traps:" pattern

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* debug: replace training with FSDP diagnostic test on GCP

Replaces the training run with a comprehensive diagnostic script:
- System diagnostics (memory, cgroup, fabric manager, GPU topology)
- Step-by-step FSDP tests (NCCL init, broadcast, model load, FSDP2 wrap)
- Multi-GPU FSDP via Ray (reproduces the crash scenario)
- Post-test dmesg capture for OOM/segfault analysis

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: disable NCCL NVLS to prevent FSDP worker SIGKILL on GCP H200

Root cause: NCCL's NVLink SHARP (NVLS) feature causes all workers to
SIGKILL when Fabric Manager isn't properly reset after VM creation on
GCP. This is a known issue: NVIDIA/nccl#1562

Fix: export NCCL_NVLS_ENABLE=0 before training.

Also reverts ref back to main (ref: debug/fsdp-gcp-crash caused
SkyPilot SIGABRT exit 134 on the GH Actions runner).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: disable NCCL NVLS to prevent FSDP worker SIGKILL on GCP H200/B200

Root cause: NCCL's NVLink SHARP (NVLS) feature causes all FSDP workers
to be SIGKILL'd when Fabric Manager isn't properly reset after VM
creation. This is a known issue with H200/B200 GPUs on cloud providers.
See: NVIDIA/nccl#1562

Changes:
- Add NCCL_NVLS_ENABLE=0 to fleet-common-run.sh (applies to all tasks)
- Add NCCL_NVLS_ENABLE=0 to 9B and 35B YAML run blocks (belt + suspenders)
- Add RAY_DISABLE_MEMORY_MONITOR=1 to prevent spurious Ray worker kills
- Add system diagnostics and crash diagnostics to fleet-common-run.sh
- Set HYDRA_FULL_ERROR=1 for complete stack traces on failure

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* debug: inline-patch FSDP code with GPU memory diagnostics on GCP

NCCL_NVLS_ENABLE=0 didn't fix the FSDP worker SIGKILL on GCP. The
root cause is still unknown. This adds comprehensive diagnostics:

- Pre-training: system memory, cgroup limits, GPU info, fabric manager
- Inline Python patches to fsdp_utils.py and fsdp_strategy.py that add
  GPU memory logging at every stage of fsdp2_load_full_state_dict
- Background nvidia-smi dmon for continuous GPU memory monitoring
- Post-training: dmesg, memory state, GPU state
- Env vars: RAY_DISABLE_MEMORY_MONITOR=1, NCCL_DEBUG=INFO

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: disable NCCL P2P when Fabric Manager is down on GCP

Root cause found: Fabric Manager is FAILED on GCP spot B200/H200 VMs.
Without FM, NVSwitch/NVLink P2P communication crashes with SIGKILL
during the first NCCL dist.broadcast() in fsdp2_load_full_state_dict.

Diagnostic data from run 23234079453:
- GPU: NVIDIA B200, 183 GiB, driver 570.211.01
- Fabric Manager: FAILED
- GPU memory at crash: ~6 GiB used, ~175 GiB free (not OOM)
- dmesg: empty (not kernel OOM killer)
- Crash: during first dist.broadcast() of 760 FSDP params

Fix: Check FM status at startup and set NCCL_P2P_DISABLE=1 if FM is
not active. This forces NCCL to use shared memory transport instead
of NVLink/NVSwitch, which works without FM (slower but functional).

Also attempts to start FM before falling back.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* debug: remove NCCL overrides, add FM debugging and NCCL shim inspection

GCP has a custom NCCL shim (/nccl-shim/) that manages all NCCL
configuration. Our manual env var overrides (NCCL_P2P_DISABLE,
NCCL_NVLS_ENABLE, NCCL_CUMEM_ENABLE) conflict with the shim and
may cause the worker SIGKILL during dist.broadcast().

Changes:
- Remove ALL NCCL env var overrides (unset them explicitly)
- Add Fabric Manager deep debugging (journalctl, config, direct invocation)
- Add NCCL shim inspection (contents, config, libraries)
- Add pre-training NCCL communication test
- Add NVLink status check
- Keep FSDP diagnostic patches for GPU memory monitoring

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* debug: simplify YAML to fix SkyPilot SIGABRT (remove Python heredocs)

SkyPilot was crashing with exit code 134 (SIGABRT) during launch,
likely due to the large YAML run block with embedded Python heredocs.
Simplified to only essential diagnostics and training command.

Key change: no NCCL env vars set (let GCP NCCL shim manage config).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* debug: temporarily disable GCP (SkyPilot SIGABRT on GCP VM creation)

SkyPilot crashes with exit code 134 (SIGABRT) during GCP VM creation.
Temporarily remove GCP resource blocks to test on RunPod/Lambda/Nebius.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: skip NCCL_CUMEM_ENABLE on GCP to prevent FSDP worker SIGKILL

Root cause: GCP has a custom NCCL shim (/nccl-shim/) that manages all
NCCL configuration. Setting NCCL_CUMEM_ENABLE=0 (done by
prepare_runtime_environment() for vLLM compat) conflicts with the shim
and causes FSDP ref workers to be SIGKILL'd during the first
dist.broadcast() call in fsdp2_load_full_state_dict().

Changes:
- utils.py: Detect GCP (/nccl-shim or /usr/local/gib) and skip setting
  NCCL_CUMEM_ENABLE=0 when on GCP
- fleet-common-run.sh: Remove all NCCL env var overrides (NVLS, P2P,
  DEBUG), improve Fabric Manager restart (add persistence mode)
- 9B YAML: Restore GCP resource blocks, remove all debug diagnostics
- 35B YAML: Remove NCCL_NVLS_ENABLE=0 from run block

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: reset SkyPilot API state on runner to prevent SIGABRT

The self-hosted runner accumulates stale SkyPilot cluster refs that
cause SIGABRT during sky launch. Clean up the API server state at
the start of each run, and refresh cluster status after cloud checks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: start SkyPilot API after gcloud install so GCP is detected

The API server captures PATH at startup. Starting it before gcloud is
installed means sky check can't find GCP tools. Move api start to the
Verify step (after Configure Cloud Credentials installs gcloud).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* revert: restore original workflow Clean/Verify steps

The aggressive SkyPilot cleanup (pkill, rm -rf api_server) was causing
the runner to fail. Revert to the original workflow structure that was
working in run 23232808254.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: use debug branch for workdir to test NCCL fix on GCP

The workdir.ref was 'main' which doesn't have the NCCL_CUMEM_ENABLE
skip in utils.py. Point to debug/fsdp-gcp-crash so the GCP VM uses
our fix. Will revert to 'main' after validation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* debug: add GCP detection logging and NCCL/FM diagnostics

Workers still SIGKILL'd on GCP even with NCCL_CUMEM fix. Need to
confirm: (1) GCP detection paths exist, (2) NCCL_CUMEM_ENABLE is
actually skipped, (3) Fabric Manager status, (4) NCCL shim presence.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* debug: robust GCP detection (DMI) + NCCL_DEBUG on GCP workers

Previous /nccl-shim detection may fail on B200 a4-highgpu VMs. Add
DMI product_name check ('Google Compute Engine') as fallback.
Also enable NCCL_DEBUG=INFO on GCP workers to capture communication
errors before SIGKILL.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: aggressive Fabric Manager restart on GCP spot VMs

Root cause confirmed: Fabric Manager fails to start on GCP spot VMs
(status: 'failed'), causing dist.broadcast() SIGKILL during FSDP init.
FM is required for NVLink P2P on NVSwitch GPUs (B200, H200 SXM).

- Add multi-attempt FM restart with full GPU reset cycle
- Add comprehensive FM failure diagnostics (journalctl, driver info)
- Keep NCCL_CUMEM_ENABLE skip on GCP as secondary fix

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: disable NCCL P2P when Fabric Manager fails, propagate to workers

When Fabric Manager can't start on GCP spot VMs, set
NCCL_P2P_DISABLE=1 and NCCL_NVLS_ENABLE=0 as fallback. This forces
NCCL to use shared memory transport instead of NVLink, which is
slower but avoids SIGKILL during dist.broadcast().

Also propagate shell-level NCCL overrides to Ray workers via
runtime_env to ensure all FSDP workers see the settings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: don't override NCCL config on GCP — host manages NVSwitch

GCP's a4-highgpu/a3-ultragpu VMs manage NVSwitch at the host level.
The guest VM has no NVSwitch devices, so Fabric Manager correctly
reports "NV_WARN_NOTHING_TO_DO" and can't start — this is expected.

NVLink P2P works through GCP's host-managed fabric without FM.
GCP provides a custom NCCL shim (gIB) whose Guest Config Checker
expects NCCL_P2P_DISABLE, NCCL_NVLS_ENABLE, and NCCL_CUMEM_ENABLE
to be UNSET. Setting any of these breaks FSDP dist.broadcast().

Previous attempt disabled P2P as an FM-failure fallback, which
actually caused the crash by forcing NCCL away from the functional
NVLink path into gIB (inter-node only).

Changes:
- fleet-common-run.sh: Skip FM restart on GCP, keep it for non-GCP
- utils.py: Skip all NCCL env var overrides on GCP

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* debug: comprehensive diagnostics for FSDP broadcast SIGKILL on GCP

- Add NCCL test broadcast before weight loading loop (fsdp_utils.py)
- Log /dev/shm size, GPU memory, progress during broadcast
- Set NCCL_DEBUG=INFO on GCP workers for transport visibility
- Expand shell diagnostics: /dev/shm, GPU topology, cgroups, ulimits
- Remount /dev/shm to 16G on GCP if too small (preventive)
- Capture full dmesg + cgroup events + Ray worker logs on crash

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: source gIB NCCL env vars on GCP to prevent Config Checker SIGKILL

Root cause: GCP's gIB plugin includes a Config Checker that validates
NCCL env vars at the first collective operation. If vars don't match
expected values, it SIGKILLs the process — explaining why FSDP workers
die during dist.broadcast() (the first real multi-GPU collective).

Fix:
- Source /usr/local/gib/scripts/set_nccl_env.sh before Ray start
- Forward all NCCL_* env vars to Ray workers via runtime_env
- Add /usr/local/gib/lib64 to LD_LIBRARY_PATH
- Set NCCL_CUMEM_HOST_ENABLE=0 (driver 570 cuMem bug workaround)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: disable gIB on GCP single-node to prevent NCCL broadcast crash

Root cause: GCP deep learning images install /etc/profile.d/nccl_env.sh
which auto-sets NCCL_NET=gIB and adds /usr/local/gib/lib64 to
LD_LIBRARY_PATH. The gIB plugin requires RDMA/InfiniBand (/dev/infiniband)
for inter-node communication. On instances without RDMA devices, gIB
fails to initialize → "Failed to initialize any NET plugin" → SIGKILL
during the first dist.broadcast() in FSDP weight loading.

For single-node training, gIB is unnecessary — intra-node communication
uses NVLink P2P directly. Fix: strip gIB from LD_LIBRARY_PATH and unset
NCCL_NET before starting Ray, so NCCL falls back to NVLink P2P + Socket.

Verified on GCP a3-ultragpu-8g (H200):
- WITH gIB forced (NCCL_NET=gIB): dist.broadcast crashes
- WITHOUT gIB: all 8-GPU broadcasts pass

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* cleanup: remove FSDP broadcast diagnostic code

Root cause identified and fixed — no longer need the test broadcast,
/dev/shm check, or progress logging in fsdp2_load_full_state_dict.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: strip gIB when RDMA hardware absent (not just single-node)

SkyPilot provisions GCP VMs with a single management NIC — no RDMA
networking. gIB requires ConnectX NICs + GPUDirect VPC networks.
Check /sys/class/infiniband instead of node count so multi-node
training also works (falls back to NVLink P2P + Socket/TCP).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* add GCP on-demand fallback for B200/H200

Spot zones are frequently stocked out. Add on-demand options
so SkyPilot can fall back when spot is unavailable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* revert workdir ref to main for merge

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add GKE spot node pools (H200 + B200) with GPUDirect-RDMA

Adds GKE kubernetes entries with use_spot to both 9B and 35B task
YAMLs targeting fleet-rdma cluster. 35B includes network_tier: best
for RDMA inter-node networking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: update gIB comments — multi-node uses GKE RDMA, not TCP fallback

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Remove num_nodes workflow input — use YAML-defined value

num_nodes is a property of the task YAML, not a launch parameter.
The 9B config uses 1 node, 35B uses 2 nodes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* 35B: enable partial_reward, increase context to 96K, update to v55

- partial_reward=true: denser gradient signal, stabilized 9B iter#4
- max_input_length 64K→96K: carlisle/wallst/budget length-limited at 64K
- v55 dataset: remaining verifier contamination fixes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* 35B: TP=2 for vLLM engines to handle 96K context

8 engines with 2 GPUs each instead of 16 single-GPU engines.
More memory per engine for KV cache, faster prefill on long sequences.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Deniz <deniz@Mac.localdomain>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…12.8

causal-conv1d 1.5.x has a NameError (bare_metal_version not defined) in
setup.py when building from source with newer CUDA versions. 1.6.0+ fixes
the CUDA version detection logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a massive restructuring of the project, moving towards a modular skyrl full-stack library and adding the skyrl-agent component. It includes significant updates to documentation, Docker infrastructure, and adds numerous examples and data processing scripts. The changes are extensive and set a new foundation for the project. My review focuses on improving maintainability and portability by addressing issues like code duplication in Dockerfiles and hardcoded paths in scripts and documentation. I've also identified a potential bug in the GHA runner health check script.

Comment on lines +1 to +20
FROM anyscale/ray:2.51.1-slim-py312-cu128

RUN pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable
RUN sudo apt-get update -y && sudo apt-get install -y wget kmod libxml2 build-essential libnuma-dev

RUN cd /opt/nvidia && git clone --single-branch --branch core_r0.11.0 https://github.com/NVIDIA/Megatron-LM.git Megatron-LM
# the cuda compiler here is needed for deepspeed
RUN wget https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_570.86.10_linux.run \
&& sudo sh cuda_12.8.0_570.86.10_linux.run --silent --toolkit && rm -rf cuda_12.8.0_570.86.10_linux.run

# only config pip index with https://pypi.tuna.tsinghua.edu.cn/simple if needed
# unset for now
RUN cd /opt/nvidia/Megatron-LM && pip3 install --no-deps -e . No newline at end of file
RUN curl -LsSf https://astral.sh/uv/0.9.4/install.sh | sh
RUN echo "export RAY_RUNTIME_ENV_HOOK=ray._private.runtime_env.uv_runtime_env_hook.hook" >> /home/ray/.bashrc


RUN sudo apt-get update \
&& sudo apt-get install -y openssh-server iputils-ping net-tools iproute2 traceroute netcat \
libopenexr-dev libxi-dev libglfw3-dev libglew-dev libomp-dev libxinerama-dev libxcursor-dev tzdata \
&& sudo apt-get clean && sudo rm -rf /var/lib/apt/lists/*

RUN sudo apt update && sudo apt install --fix-broken && sudo apt install -y default-jre-headless openjdk-8-jdk \
&& sudo apt-get clean \
&& sudo rm -rf /var/lib/apt/lists/*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is significant code duplication between this Dockerfile and docker/Dockerfile. The first 20 lines are nearly identical. This makes maintenance difficult, as changes need to be applied in multiple places. Consider creating a common base Dockerfile (e.g., Dockerfile.base) and have Dockerfile, Dockerfile.megatron, and Dockerfile.ray244 build FROM it. This would centralize the common setup steps like installing system packages, CUDA toolkit, and uv.

Comment on lines +1 to +16
FROM anyscale/ray:2.44.0-slim-py312-cu128

RUN sudo apt-get update -y && sudo apt-get install -y wget kmod libxml2 build-essential libnuma-dev

# the cuda compiler here is needed for deepspeed
RUN wget https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_570.86.10_linux.run \
&& sudo sh cuda_12.8.0_570.86.10_linux.run --silent --toolkit && rm -rf cuda_12.8.0_570.86.10_linux.run
RUN curl -LsSf https://astral.sh/uv/install.sh | sh
RUN echo "export RAY_RUNTIME_ENV_HOOK=ray._private.runtime_env.uv_runtime_env_hook.hook" >> /home/ray/.bashrc
RUN sudo apt-get update \
&& sudo apt-get install -y openssh-server iputils-ping net-tools iproute2 traceroute netcat \
libopenexr-dev libxi-dev libglfw3-dev libglew-dev libomp-dev libxinerama-dev libxcursor-dev tzdata \
&& sudo apt-get clean && sudo rm -rf /var/lib/apt/lists/*
RUN sudo apt update && sudo apt install --fix-broken && sudo apt install -y default-jre-headless openjdk-8-jdk \
&& sudo apt-get clean \
&& sudo rm -rf /var/lib/apt/lists/*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This Dockerfile duplicates a large portion of setup code from docker/Dockerfile and docker/Dockerfile.megatron. To improve maintainability and reduce redundancy, it's highly recommended to use a multi-stage build or a common base image that contains all the shared setup instructions.

Comment on lines +17 to +19
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | sudo tee -a "$LOG_FILE" >/dev/null
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The log function attempts to write to /var/log/runner-health-check.log using sudo tee. However, the setup-gha-runner.sh script installs the cron job for the current user, not root. When the cron job runs, the sudo command will fail because it cannot ask for a password, leading to the health check script failing to log anything. The log file should be written to a user-writable location, or the cron job should be installed in the system-wide crontab to run as root.

Suggested change
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | sudo tee -a "$LOG_FILE" >/dev/null
}
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" >> "$LOG_FILE"
}

Comment on lines +34 to +35
- Fleet SDK repo: `/Users/deniz/repos/fleet-sdk`
- OpenEnv repo: `/Users/deniz/repos/OpenEnv`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The file paths for fleet-sdk and OpenEnv are hardcoded to a specific user's local directory (/Users/deniz/...). This makes the instructions not generally usable for other developers. These should be replaced with placeholder paths or instructions on how to set them up.

Suggested change
- Fleet SDK repo: `/Users/deniz/repos/fleet-sdk`
- OpenEnv repo: `/Users/deniz/repos/OpenEnv`
- Fleet SDK repo: `<path/to/your/local/fleet-sdk>`
- OpenEnv repo: `<path/to/your/local/OpenEnv>`


```bash
# 1. Launch EC2 (copy config from existing runner)
aws ec2 run-instances --image-id ami-0c7217cdde317cfec --instance-type t3.xlarge --key-name gha-runner-key --security-group-ids sg-00fefd8181d51909d --subnet-id subnet-03879810067f57f85 --block-device-mappings '[{"DeviceName":"/dev/sda1","Ebs":{"VolumeSize":50,"VolumeType":"gp3"}}]' --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=fleet-runner-N}]'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The documentation contains hardcoded AWS resource IDs, such as AMI ID, security group ID, subnet ID, and instance IDs (also on line 76). This makes the commands not directly usable and prone to copy-paste errors. These should be replaced with placeholders (e.g., <YOUR_AMI_ID>) to make it clear that users need to substitute their own values.

Suggested change
aws ec2 run-instances --image-id ami-0c7217cdde317cfec --instance-type t3.xlarge --key-name gha-runner-key --security-group-ids sg-00fefd8181d51909d --subnet-id subnet-03879810067f57f85 --block-device-mappings '[{"DeviceName":"/dev/sda1","Ebs":{"VolumeSize":50,"VolumeType":"gp3"}}]' --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=fleet-runner-N}]'
aws ec2 run-instances --image-id <YOUR_AMI_ID> --instance-type t3.xlarge --key-name <YOUR_KEY_NAME> --security-group-ids <YOUR_SECURITY_GROUP_ID> --subnet-id <YOUR_SUBNET_ID> --block-device-mappings '[{"DeviceName":"/dev/sda1","Ebs":{"VolumeSize":50,"VolumeType":"gp3"}}]' --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=fleet-runner-N}]'

os.environ["OPENAI_API_KEY"] = "sc"
model = "Qwen/Qwen3-32B"
tokenizer = AutoTokenizer.from_pretrained(model)
dataset_file = "/data/sycao/r2e-all/train.parquet"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The path to the dataset is hardcoded. This makes the script difficult to run in different environments. This path should be configurable, for example, via a command-line argument or an environment variable.

Suggested change
dataset_file = "/data/sycao/r2e-all/train.parquet"
dataset_file = os.environ.get("DATASET_FILE", "/data/sycao/r2e-all/train.parquet")

Comment on lines +3 to +8
DATA_DIR="/mnt/shared_storage/datasets/r2e-all"
TRAIN_DATA="${DATA_DIR}/train.parquet"
VAL_DATA="${DATA_DIR}/validation.parquet"

CKPT_DIR=$HOME/ckpts
EXPORT_DIR=$HOME/exports
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The script contains hardcoded paths for datasets and checkpoint directories (e.g., /mnt/shared_storage/datasets/r2e-all, $HOME/ckpts). This makes the script not portable and difficult for other developers to run without modification. It's better to use environment variables for these paths with sensible defaults, allowing users to easily override them.

Suggested change
DATA_DIR="/mnt/shared_storage/datasets/r2e-all"
TRAIN_DATA="${DATA_DIR}/train.parquet"
VAL_DATA="${DATA_DIR}/validation.parquet"
CKPT_DIR=$HOME/ckpts
EXPORT_DIR=$HOME/exports
DATA_DIR="${DATA_DIR:-/mnt/shared_storage/datasets/r2e-all}"
TRAIN_DATA="${TRAIN_DATA:-${DATA_DIR}/train.parquet}"
VAL_DATA="${VAL_DATA:-${DATA_DIR}/validation.parquet}"
CKPT_DIR="${CKPT_DIR:-$HOME/ckpts}"
EXPORT_DIR="${EXPORT_DIR:-$HOME/exports}"

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants