Conversation
* feat(fleet): add Tinker backend for Fleet task training Add support for training on Fleet environments using Tinker (hosted) as the training and inference backend. This provides an alternative to the existing PyTorch/Ray/vLLM setup. New files: - main_fleet_tinker.py: Training entrypoint using Tinker API - Uses existing FleetTaskEnv for environment interaction - GRPO advantage estimation - Checkpoint management - WandB logging - openenv-fleet-train-tinker.yaml: CI workflow - Much simpler than SkyPilot version (no GPU provisioning) - Tinker handles compute allocation - Same inputs (modality, env_key, max_tasks, etc.) Required secrets: - TINKER_API_KEY: Tinker hosted service authentication 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(tinker): complete Fleet+Tinker integration - Add DictConfig wrapper for FleetTaskEnv (required by SkyRL's env) - Configure Tinker ServiceClient with API URL and key from env vars - Add advantage metrics (mean, std) to WandB logging - Add per-environment rollout metrics (turns, tool_calls) - Remove vLLM from CI (Tinker handles inference) - Add TINKER_API_URL to CI environment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(tinker): complete Fleet+Tinker integration with SkyRL metrics Key changes: - Use OpenEnv FleetTaskEnv directly with async methods (reset_async, step_async) to avoid nested asyncio.run() issues - Add pass@k metrics matching SkyRL's implementation - Add per-environment metrics (reward/{env}/pass_at_n, avg_score) - Add evaluation metrics (eval/all/pass_at_1, per-env breakdown) - Build system prompt with tools inline (like SkyRL's FleetTaskEnv) - Proper task config normalization from JSON WandB metrics now match SkyRL: - reward/avg_pass_at_{n}: Overall pass@k - reward/avg_raw_reward: Average reward - reward/{env_key}/pass_at_{n}: Per-environment pass@k - eval/all/pass_at_1: Evaluation pass@1 - eval/{env_key}/pass_at_1: Per-environment eval 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * style: format fill_results_from_wandb.py with black 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: clarify TINKER_API_URL is optional The Tinker SDK uses a default endpoint if TINKER_API_URL is not set. Only TINKER_API_KEY is required for authentication. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Temporarily send Slack notifications to test channel while validating the Tinker integration. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
The Fleet MCP client requires the 'mcp' package which is an optional dependency in OpenEnv. Install with openenv[fleet] to get both mcp and fleet-python dependencies. Fixes: ModuleNotFoundError: No module named 'mcp' 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Qwen2.5-1.5B-Instruct is not supported by Tinker API. Switch to Qwen3-VL-30B-A3B-Instruct which is available. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
apply_chat_template can return BatchEncoding dict instead of plain list on some tokenizers. Tinker's ModelInput.from_ints() requires a plain list of integers. Added tokenize_chat() helper to handle both cases. Fixes: EncodedTextChunk validation error 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
- Pass single ModelInput, not a list - Add required num_samples=1 argument API signature per docs: sample(prompt, num_samples, sampling_params, ...) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
SampledSequence uses 'tokens' not 'token_ids' per API docs: - stop_reason: Reason why sampling stopped - tokens: List of generated token IDs - logprobs: Log probabilities for each token 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
) * fix(tinker): add max_input_length check to prevent context overflow Match SkyRL's approach in skyrl_gym_generator.py:274 - end rollout when context exceeds max_input_length instead of hitting API error. Changes: - Add max_generate_length param (renamed from max_tokens) - Add max_input_length param (default 30720 = 32768 - 2048) - Check context length at start of each turn, break with stop_reason="length" - Track and return stop_reason in rollout output 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * chore: use #fleet-training-runs Slack channel --------- Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* fix(tinker): truncate overlong sequences with DAPO filtering Match SkyRL's approach: truncate sequences exceeding max_sequence_length and zero out their loss mask (DAPO overlong filtering). This prevents the Tinker API error while keeping sequences in the batch. Changes: - Add max_sequence_length param (default 32768) - Truncate sequences > max_sequence_length to fit model context - Zero out loss mask for truncated sequences (won't contribute to loss) - Track truncated_overlong count in metrics 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: add Tinker development guidelines to CLAUDE.md * Add unit tests for sequence truncation and overlong filtering - Add integrations/fleet/utils.py with refactored pure Python functions: - truncate_sequence: truncate prompt+response to max_sequence_length - truncate_auxiliary_data: truncate logprobs/loss_mask to match - apply_overlong_filtering_simple: DAPO filtering (zero mask if no EOS) - prepare_training_sequence: combined truncation logic - Add integrations/fleet/tests/test_tinker_training.py with 20 tests: - TestTruncateSequence: sequence truncation behavior - TestTruncateAuxiliaryData: logprobs/mask truncation - TestOverlongFiltering: DAPO EOS-based filtering - TestPrepareTrainingSequence: combined preparation - TestCombinedFlow: full DAPO + truncation flow 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add duration timer to collect_fleet_rollout - Track rollout duration in collect_fleet_rollout and error cases - Log per-environment duration metrics: rollout/{env_key}/duration - Log overall duration stats: rollout/avg_duration, max_duration, min_duration 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Refactor main_fleet_tinker.py to use FleetTaskEnv from env.py - Add init_async() and step_async() methods to FleetTaskEnv in env.py - Async methods contain the actual logic (await OpenEnv's async methods) - Sync methods (init, step) are thin wrappers using asyncio.run() - Enables both sync (SkyRL generator) and async (Tinker) usage - Refactor main_fleet_tinker.py to use the shared FleetTaskEnv wrapper: - Import FleetTaskEnv from integrations.fleet.env - Use env.init_async() for initialization (handles system prompt, tools) - Use env.step_async() for stepping (handles tool parsing, chat history) - Access env.chat_history for tokenization, env.turns/tool_calls for metrics - Remove duplicated code from main_fleet_tinker.py: - load_tasks_from_json (now in env.py) - build_system_prompt (handled by env.init_async) - parse_tool_call (handled by env.step_async) - Manual chat history management Single source of truth for Fleet environment logic. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Extract prepare_training_data() function (matching SkyRL pattern) Refactor training data preparation into a dedicated function, similar to SkyRL's generate_batched pattern: - prepare_training_data() handles: 1. DAPO overlong filtering (zero loss mask if no EOS) 2. Sequence truncation for max_sequence_length 3. Building Tinker Datum objects - main() is now cleaner, focused on orchestration: - Rollout collection - Metrics computation - prepare_training_data() call - Training step This improves code organization, testability, and matches SkyRL's separation of concerns between rollout collection and data preparation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Extract compute_rollout_metrics() and update tests - Add compute_rollout_metrics() function for all rollout metrics: - Core reward metrics (pass@n, avg_reward, mean_positive) - Advantage stats (mean, std) - Rollout counts (valid, total) - Per-environment metrics - Per-environment rollout stats (turns, tool_calls, duration) - Overall duration stats - Update tests: - Add docstring explaining tests validate prepare_training_data pattern - Add test_batch_processing for multi-rollout scenarios main() is now cleaner with clear separation: 1. Rollout collection 2. Advantage computation 3. compute_rollout_metrics() 4. prepare_training_data() 5. Training step 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add logging for invalid rollouts Log each invalid rollout with: - task_key - error message (or "no response_ids") - stop_reason Also track rollouts/invalid metric in wandb. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add Fleet integration architecture diagram Document showing: - Training loops (Tinker async vs SkyRL sync) - SkyRL FleetTaskEnv wrapper (sync/async methods) - OpenEnv FleetTaskEnv (low-level) - Fleet Platform - Communication flow and data structures 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Remove unused pytest import (ruff fix) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Improve architecture diagram: add function details, simplify layout * Fix diagram: SkyRL uses vLLM on GPU, not local inference --------- Co-authored-by: Deniz <deniz@Mac.localdomain> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Previously, rollouts were collected sequentially in a for loop, waiting for each rollout to complete before starting the next. Since rollouts are independent and I/O-bound (HTTP calls to Fleet + Tinker), they can run concurrently. With batch_size=8 and n_samples_per_prompt=4, this could be up to 32x faster for rollout collection. Co-authored-by: Deniz <deniz@Mac.localdomain>
- Change default model from Qwen3-VL-30B to Qwen3-8B - Fix Python syntax error in task count command (bash escaping issue) Co-authored-by: Deniz <deniz@Mac.localdomain>
* fix(ci): add cancelled notification and default to 50 steps - Add Slack notification when training run is cancelled - Change default max_steps from 200 to 50 for faster iteration * Rename Tinker -> SkyRL in Slack notifications * Revert "Rename Tinker -> SkyRL in Slack notifications" This reverts commit 000226361ce19b51dae9859c2d69663fcd03d77e. * Rename Fleet -> SkyRL in SkyPilot workflow notifications * Suppress MCP client INFO logs (mcp.client.streamable_http) * Consolidate Slack notifications: generic headers with Backend field - Change headers from 'Tinker/SkyRL Training' to just 'Training' - Add 'Backend' field (Tinker or SkyRL) in the message body - Change channel to #fleet-training-runs-test for testing --------- Co-authored-by: Deniz <deniz@Mac.localdomain>
* feat: add progress logging during rollout collection Log progress at ~25%, 50%, 75%, 100% completion: Progress: 8/32 rollouts completed Progress: 16/32 rollouts completed Progress: 24/32 rollouts completed Progress: 32/32 rollouts completed Uses asyncio.as_completed to track progress while maintaining parallel execution. * fix(tinker): use dict access for TypedDict step output BaseTextEnvStepOutput is a TypedDict, not a class with attributes. Use dict access (["observations"], ["reward"], ["done"]) instead of attribute access (.observations, .reward, .done). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(tinker): add timestamps to log messages Logs now show HH:MM:SS timestamps for easier debugging: 12:34:56 INFO __main__: Step 0: Collecting rollouts... 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Deniz <deniz@Mac.localdomain> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* fix(tinker): limit concurrent Fleet env connections with semaphore Fleet MCP connections timeout when too many are opened simultaneously. Add semaphore to limit concurrent rollouts to 4 (configurable via max_concurrent parameter). With batch_size=8 and n_samples_per_prompt=4, we were trying to open 32 MCP connections at once, causing connection timeouts. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * refactor(tinker): use Pydantic RolloutOutput instead of dict Replace dict returns with typed Pydantic model for better validation and IDE support. Fields: - prompt_ids, response_ids, logprobs, loss_mask (sequences) - reward, task_key, env_key, turns, tool_calls, stop_reason, duration - error (optional, for failed rollouts) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Deniz <deniz@Mac.localdomain> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* fix(tinker): reduce max concurrent rollouts from 4 to 2 Fleet MCP connections still timing out with 4 concurrent rollouts. Reduce to 2 to further decrease pressure on Fleet infrastructure. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(tinker): use larger GitHub runner (4-cores) and restore concurrency - Use ubuntu-latest-4-cores (4 vCPU, 16GB RAM) instead of ubuntu-latest - Restore max_concurrent to 4 (larger runner can handle more connections) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(tinker): update WandB run name format to match SkyRL Change from: tinker_tool_use_0a7278bc_20260128-2127 Change to: fleet-tool-use-0a7278bc-20260128-2127 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Deniz <deniz@Mac.localdomain> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Shows step progress with key metrics: Training: 10%|██ | 5/50 [02:30<22:30, pass@4=0.125, reward=0.05, time=30.1s] 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Deniz <deniz@Mac.localdomain> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Deniz <deniz@Mac.localdomain> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Quick reference for SkyRL and Tinker training runs: - Backend comparison - GitHub Actions parameters - Monitoring (Slack, WandB) - Testing small runs - Troubleshooting common issues - Required secrets 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Deniz <deniz@Mac.localdomain> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
- Revert runner from ubuntu-latest-4-cores to ubuntu-latest (larger runners require org-level enablement) - Reduce max_concurrent from 4 to 2 to prevent MCP timeouts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Deniz <deniz@Mac.localdomain> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* fix: disable pip cache for OpenEnv to get latest code * fix: use ThreadPoolExecutor for env ops to isolate MCP connections Match SkyRL's pattern - run env.init/step in threads so each gets its own event loop and isolated httpx connections. Fixes MCP timeout issues caused by shared connection pool contention. * feat: increase ThreadPoolExecutor max_workers to 16 * feat: increase max_concurrent to 8 (safe with ThreadPoolExecutor) --------- Co-authored-by: Deniz <deniz@Mac.localdomain>
502 Bad Gateway errors were occurring exactly 10 minutes into rollouts because instances were hitting TTL expiration. Co-authored-by: Deniz <deniz@Mac.localdomain>
Co-authored-by: Deniz <deniz@Mac.localdomain>
#85) Two issues causing training failures: 1. TTL of 30 min still not enough - some rollouts with many turns take 30+ minutes, causing 502 Bad Gateway when instances expire. Increased to 7200s (2 hours). 2. Pydantic attribute access bug - RolloutOutput is a Pydantic model but code was using dict-style `.get()` access. Fixed to use attribute access for filtering and added `rollout_to_dict()` helper for metrics functions that expect dict format. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Log each turn completion with: - gen: Tinker generation time - step: Fleet environment step time (MCP tool call) - total: total turn time - toks: tokens generated - reward: current reward - status: DONE or ... Example output: [task_key] Turn 1: gen=2.3s step=1.5s total=3.8s toks=156 reward=0.00 ... [task_key] Turn 2: gen=1.8s step=0.9s total=2.7s toks=89 reward=1.00 DONE 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* feat(fleet): add step timing logs to FleetTaskEnv Add timing instrumentation to compare SkyRL vs Tinker performance: Init logs: [task_key] Init: env=3.2s reset=8.5s total=11.7s tools=100 Step logs: [task_key] Turn 1: step=85.2s mcp=85.0s tool=search reward=0.00 ... [task_key] Turn 2: step=42.1s mcp=42.0s tool=click reward=1.00 DONE Also adds step_time and mcp_time to step metadata for downstream analysis. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(tinker): add timing metrics to WandB Track generation and Fleet step times in WandB: Timing metrics: - time/gen_total, time/gen_mean: Tinker generation time - time/step_total, time/step_mean: Fleet MCP step time - time/gen_pct, time/step_pct: Percentage breakdown Throughput metrics: - throughput/tokens_total: Total tokens generated - throughput/tokens_per_sec_gen: Generation throughput (Tinker only) - throughput/tokens_per_sec_effective: End-to-end throughput (including MCP) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* fix(tinker): use async sampling to avoid event loop blocking The synchronous `.result()` call on `sampling_client.sample()` was blocking the event loop, causing concurrent rollouts to serialize instead of running in parallel. This resulted in 100+ second step times when the actual MCP calls only took ~1 second. Changed to use `sample_async()` with double await pattern: - First await returns a future - Second await gets the result This allows the event loop to continue processing other rollouts while waiting for Tinker generation to complete. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * chore: use #fleet-training-runs channel (remove -test) * chore: remove turn-based console logging (keep timing in metadata for WandB) --------- Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
#275) Some v5 environments have large MCP tool schemas that can push the initial prompt past max_input_length before any generation happens. This causes response_end_idx to remain None, crashing with: TypeError: unsupported operand type(s) for -: 'NoneType' and 'int' Changes: - Bump MAX_INPUT_LENGTH from 48000 to 64000 (all model configs) - Bump YaRN rope_scaling factor from 2.0 to 2.5 for Qwen3-8B (32768 * 2.5 = 81920 effective context, enough for 64K + 8K generate) - Qwen3-32B rope unchanged (40960 * 2.0 = 81920, already sufficient) - Add guard in agent_loop for when prompt exceeds max_input_length before any generation: logs warning and returns zero-reward trajectory Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
With max_input_length=64000, forward_backward OOMs on H100s. Halve micro batch sizes: - 8B (4gpu): 2 → 1 - 8B (8gpu): 4 → 2 - 8B step-wise: already at 1, no change Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Modality is already defined in each SkyPilot task YAML's MODALITY env var, so the workflow input was redundant. WANDB key selection now uses the task-derived modality instead of hardcoding tool_use. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Default TTL is now None, which lets OpenEnv auto-select based on modality. Can be overridden via env_config.ttl_seconds in task YAML. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SkyRL's default.yaml hardcoded ttl_seconds=600, which overrode OpenEnv's 900s default. Logfire confirmed the orchestrator received 600s for every instance in the training run, causing premature expiry and downstream 502 Bad Gateway errors on tool calls. Co-authored-by: Deniz <deniz@Mac.localdomain> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* feat: add GCP spot H200 as fallback compute option in all task YAMLs Adds GCP preemptible H200 entries to all SkyPilot task YAML configs as the last fallback option. GCP project fleet-compute-489000 has 64 preemptible H200 GPUs per region. Storage remains on S3 (cross-cloud). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add GCP to GHA workflow (cloud credentials, sky check, dropdown) - Add GCP service account key setup in Configure Cloud Credentials step - Install google-api-python-client and google-cloud-storage on runner - Add gcp to sky check verification step - Add gcp option to cloud dropdown in workflow_dispatch Requires GCP_SERVICE_ACCOUNT_KEY secret to be added to the repo. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: update test to match ttl_seconds=900 default The default was changed to 900 in config but the test still asserted 600. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Deniz <deniz@Mac.localdomain> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* feat: make Qwen3.5-9B the default tool-use training model Adds a new task YAML for Qwen3.5-9B in text-only (tool_use) mode, adapted from the CU version on feat/qwen3.5-9b. Key differences from the Qwen3-8B config: v51 dataset, gpu_memory_utilization=0.8, no YaRN (native 262K context), flash_attn=false (torch 2.10 compat), vLLM nightly + transformers from source for Qwen3.5 support, and CUDA toolkit install for FlashInfer JIT (GatedDeltaNet kernels). Updates the GHA workflow to make qwen3_5-9b the default task. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: remove xlam-70b and glm-4.7-flash from workflow options Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: set max_prompt_length to MAX_INPUT_LENGTH (64000) 2048 tokens was filtering out tasks with long tool schema lists. Since max_prompt_length controls the initial prompt filter in dataset.py, it should match the input budget so no valid tasks are dropped. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use feat/qwen3.5-9b branch for workdir (vLLM nightly compat) The vLLM nightly stack (torch 2.10, newer Ray, transformers source) requires compatibility fixes that only exist on feat/qwen3.5-9b: - Ray import fallback for ray.experimental.collective.util removal - torch 2.10 Parameter.__new__ patch for accelerate compat - return_dict=False for newer transformers apply_chat_template - VL model detection in model_wrapper (Qwen3.5 config is VL) - FSDP2 set-to-list fix main branch code crashes immediately with ModuleNotFoundError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: cherry-pick vLLM nightly compat fixes instead of pointing to VL branch Instead of using feat/qwen3.5-9b (which includes all VL/multimodal code), cherry-pick only the minimal compatibility fixes needed for the nightly stack: - inference_engines/utils.py: Ray import fallback for removed ray.experimental.collective.util module - distributed/fsdp_utils.py: torch 2.10 Parameter.__new__ patch for accelerate compat + FSDP2 set-to-list fix - generators/utils.py + skyrl_gym_generator.py: return_dict=False for newer transformers apply_chat_template (returns dict by default now) - dataset/dataset.py: same return_dict=False fix workdir.ref now points to this branch (feat/qwen3.5-9b-tool-use). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: update mock tokenizer to accept **kwargs (return_dict compat) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: patch ALLOWED_LAYER_TYPES for vLLM nightly compat vLLM nightly dev267+ imports ALLOWED_LAYER_TYPES from transformers.configuration_utils, but transformers main split it into ALLOWED_ATTENTION_LAYER_TYPES + ALLOWED_MLP_LAYER_TYPES. Add a post-install patch to create the alias. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: skip missing FSDP layer classes instead of raising Qwen3.5-9B's _no_split_modules includes Qwen3_5VisionBlock which doesn't exist when loading as AutoModelForCausalLM (text-only). Skip missing classes with a warning instead of raising, and only raise if NO classes are found. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: pre-download model weights to avoid HuggingFace race condition When 4 FSDP workers + 4 vLLM engines all try to download model weights simultaneously without auth, HuggingFace rate-limits the requests causing OSError: file not found. Pre-download in setup phase before any parallel processes start. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use python for model pre-download instead of huggingface-cli huggingface-cli not on PATH in SkyPilot setup environment despite venv being activated. Use huggingface_hub.snapshot_download() directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: limit vLLM max_model_len and disable experimental Mamba prefix caching Qwen3.5-9B has native 262K context, but vLLM auto-detects this as max_model_len causing massive KV cache + Mamba state allocation. Combined with experimental Mamba cache alignment, this causes NCCL ALLREDUCE timeouts (600s) as memory pressure stalls collective operations. - Set max_model_len=73728 (64K input + 8K generate + padding) - Disable prefix caching (experimental for Mamba/GatedDeltaNet layers) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use Hydra + prefix to append engine_init_kwargs Hydra strict mode requires + prefix when adding keys not in the base config struct. engine_init_kwargs is an empty dict by default, so max_model_len and enable_prefix_caching need +key=value syntax. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: remove enable_prefix_caching from engine_init_kwargs SkyRL's ray_wrapped_inference_engine already passes enable_prefix_caching as a separate kwarg. Adding it to engine_init_kwargs causes duplicate keyword argument error. max_model_len=73728 alone should fix the NCCL timeout by reducing memory allocation from 262K to 73K context. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: download model to local dir to prevent HF cache race condition snapshot_download to HF cache still causes shard resolution failures when 4 FSDP workers load concurrently. Download to $HOME/models/ with local_dir and verify all shards present before starting training. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add NCCL debug logging and reduce max_model_len to 32K Previous runs show 12 min of GPU activity then silent crash — consistent with NCCL timeout. Add RAY_DEDUP_LOGS=0, NCCL_DEBUG=WARN for better error visibility. Reduce max_model_len from 73K to 32K to rule out memory pressure with GatedDeltaNet/linear attention state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use transformers backend for vLLM (Qwen3_5 arch not natively supported) vLLM nightly does not have native support for Qwen3_5ForConditionalGeneration. The auto fallback doesn't work (throws ValidationError instead of falling back). Explicitly set model_impl=transformers to use HuggingFace Transformers backend for inference via vLLM's serving infrastructure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: pin vLLM to Qwen3.5 support commit instead of unpinned nightly The nightly extra-index-url resolves non-deterministically, sometimes picking a build without Qwen3_5ForConditionalGeneration support. Pin to commit 9562912 (PR #34110) which added Qwen3.5 architecture. Also add verification that the vLLM module exists before proceeding. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: pop hf_overrides from engine_init_kwargs to prevent duplicate kwarg engine_init_kwargs is spread as **kwargs alongside rope_engine_kwargs which also contains hf_overrides. Using .get() leaves hf_overrides in both dicts, causing TypeError for duplicate keyword argument. Convert to regular dict and use .pop() to remove it from engine_init_kwargs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use nightly index-url and patch config.json for TransformersForCausalLM The pinned vLLM commit wheel wasn't being resolved properly, falling back to stable which lacks Qwen3_5ForConditionalGeneration support. Switch to --index-url for nightly, and if the nightly still doesn't have native Qwen3.5, patch config.json to use TransformersForCausalLM generic backend. Remove model_impl from engine_init_kwargs (handled by config.json patch instead). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add language_model_only=true for vLLM Qwen3.5 multimodal arch Qwen3.5-9B uses Qwen3_5ForConditionalGeneration (multimodal arch). vLLM nightly doesn't support this arch directly for text-only. The --language-model-only flag tells vLLM to skip the vision encoder and use the text-only path, which resolves the architecture error. Also restore max_model_len to 73728 (was accidentally lowered to 32768). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: always use TransformersForCausalLM backend, remove language_model_only The NCCL ALLREDUCE timeout (SeqNum=865) during weight sync is caused by weight name mismatches between FSDP and vLLM. FSDP loads Qwen3_5ForCausalLM via AutoModelForCausalLM, but vLLM's native handler uses different naming. Fix: unconditionally patch config.json to TransformersForCausalLM so vLLM also loads via AutoModelForCausalLM → same Qwen3_5ForCausalLM class → identical weight names for NCCL sync. Remove language_model_only=true: this flag is for native multimodal handlers. TransformersForCausalLM already loads text-only via AutoModelForCausalLM, and the flag may cause errors when applied to a non-multimodal model wrapper. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add --retry-until-up for GPU capacity scarcity SkyPilot will continuously retry all providers until resources are available, bounded by the 72h workflow timeout on the self-hosted runner. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: install transformers from git LAST to prevent litellm downgrade litellm pins transformers<5.0, overriding the git install (5.3.0.dev0) back to 4.57.6 which lacks transformers.models.qwen3_5. Move git install after all other pip installs with --no-deps to prevent this. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: remove _MODELS import that broke on vLLM 0.13.0 nightly The setup check `from vllm.model_executor.models import _MODELS` fails on vLLM 0.13.0+ where that API was removed. Replace with a simple version assertion since we always patch config.json to TransformersForCausalLM anyway. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add 8-GPU fallback options (B200:8, H100:8, A100-80GB:8) When 4-GPU instances are sold out across all providers, fall back to 8-GPU configs. The 9B model fits easily and config uses $SKYPILOT_NUM_GPUS_PER_NODE dynamically. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use --pre --extra-index-url for vLLM nightly install --index-url breaks dependency resolution since wheels.vllm.ai/nightly only has vllm wheels, causing uv to keep stable 0.13.0. Use --extra-index-url to keep PyPI for deps + --pre to accept dev versions. Also fix version check: require 'dev' in version string (0.13.0 falsely passed the old >=13 check despite being stable without Qwen3.5 support). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: force-reinstall transformers and assert version >= 4.58 Previous runs failed because transformers 4.57.x was still installed despite the git install. Use --force-reinstall to guarantee the git version overwrites, and add an assertion that fails fast if the wrong version is installed (instead of cryptic errors during training). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: temporarily limit qwen3.5-9b to GCP-only to verify GCP works Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add GCP support to training workflow, update dataset to v52, VL disk/GPU changes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: default data_version to v52 in GHA workflow Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: install gcloud CLI for GCP support in training workflow Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: clean runner workspace before checkout to prevent stale temp files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: remove runner cleanup step that was deleting runner state files The rm -rf $RUNNER_TEMP/* was nuking _runner_file_commands created during job setup, causing checkout to fail with "Missing file at path". Use checkout's built-in clean: true instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use correct gcloud CLI download URL Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: restart SkyPilot API server before sky check so it picks up gcloud in PATH Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use 8x GPUs for GCP spot (4x not available on GCP) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: upgrade huggingface_hub before transformers dev (is_offline_mode import) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: disable MTP and remove MTP weights from index for TransformersForCausalLM Qwen3.5-9B has model.mtp (multi-token prediction) weights that the TransformersForCausalLM backend can't load. Set num_nextn_predict_layers=0 in config.json and strip MTP entries from safetensors index. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: align tool-use task YAML with working VL branch setup - Remove TransformersForCausalLM config patching and MTP weight stripping - Remove pre-download to local dir (use Qwen/Qwen3.5-9B directly from HF) - Remove --pre flag, huggingface_hub upgrade, ALLOWED_LAYER_TYPES patch - Use python directly instead of uv run --isolated - Match VL branch GPU config (8x), disk (750GB), and training params - Use simpler transformers install (uv pip install -U, not --force-reinstall) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: set MAX_INPUT_LENGTH to 64000 for tool_use (text-only doesn't need 131K) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: remove retry logic from training workflow SkyPilot --retry-until-up handles provisioning retries internally. The bash-level retry loop with 2-4-8 minute delays was redundant. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: detect VL models and use correct model class for FSDP training Qwen3.5-9B is natively multimodal (has vision_config), so vLLM loads Qwen3_5ForConditionalGeneration with weights under language_model.model.*. FSDP was using AutoModelForCausalLM (weights under model.*), causing NCCL weight sync timeout due to parameter name mismatch. Backport VL model detection from feat/qwen3.5-9b: load config first, check for vision_config, and use AutoModelForImageTextToText when detected. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Enable thinking mode for Qwen3.5-9B via chat_template_kwargs Qwen3.5 requires enable_thinking=True in the chat template to activate reasoning mode. Without this, the model never generates <think> tokens — confirmed 0/177 eval trajectories had thinking across all 14 environments. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix Hydra override: use + prefix for chat_template_kwargs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: stop pre-existing Ray before starting training cluster SkyPilot on GCP starts its own Ray on port 6380 which claims all GPUs. The training Ray on port 6479 then can't allocate placement groups. Force-stop all Ray instances before starting ours. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: reuse existing Ray instead of stopping it SkyPilot on GCP runs jobs via its own Ray (port 6380). The previous ray stop --force killed the job runner itself. Instead, detect if Ray is already running and reuse it; only start a new one if none exists. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: disable GCP resources, revert Ray to explicit port 6479 GCP disabled until SkyPilot Ray conflict is resolved (internal Ray on port 6380 conflicts with training Ray on 6479). Reverted ray status checks to explicit --address 127.0.0.1:6479. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Enable GCP spot fallback with NVIDIA driver 570+ image SkyPilot's default GCP GPU image (skypilot-gcp-gpu-ubuntu-241030) ships NVIDIA driver 535.216.01 which does not support H200 GPUs (cuInit returns CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE). Fix by specifying a Google Deep Learning VM image with driver 570 and CUDA 12.8. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Remove 6-GPU fallbacks for Qwen3.5-9B 6x GPU configs OOM during reference model forward pass — without flash-attn cross-entropy, logprobs_from_logits materializes full (seqlen, vocab) tensors which exceed per-GPU memory with 6-way FSDP. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Disable GCP spot fallback Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use chunked gather+logsumexp for bf16 logprobs to prevent OOM Without flash-attn, F.log_softmax materializes a full (seqlen, vocab) float32 tensor which OOMs for long sequences with large vocabs (e.g. 64K * 151K * 4B = 36 GiB, or 325K * 151K * 4B = 184 GiB). Replace with chunked gather+logsumexp in float32 per 1024-token chunk, reducing peak memory from O(seqlen * vocab) to O(CHUNK_SIZE * vocab). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use adaptive chunk size in logprobs_from_logits_v2 to avoid OOM during training During training forward_backward, FSDP + gradients + activations consume ~128 GiB leaving only ~300 MiB free. The fixed 1024-token chunk upcasted to float32 needs ~970 MiB (1024 * 248320 vocab * 4 bytes) which OOMs. Now dynamically computes chunk size targeting ~64 MiB per chunk based on vocab size (~67 tokens for Qwen3.5-9B's 248K vocab). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: ensure CUDA package version consistency after vllm nightly install vllm nightly can upgrade some nvidia CUDA packages (e.g. cusparse to 12.9) while leaving others at 12.8 (e.g. nvjitlink), causing ImportError: "undefined symbol: __nvJitLinkGetErrorLogSize_12_9, version libnvJitLink.so.12" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: re-enable flash-attn with torch 2.10 prebuilt wheel Install community-built flash-attn 2.8.3 wheel for torch 2.10 + CUDA 12.8 instead of uninstalling flash-attn. This re-enables: - Flash attention kernels for training (faster than SDPA fallback) - Efficient triton cross_entropy_loss for logprobs computation Set trainer.flash_attn=true to use flash attention during training. Wheel source: Dao-AILab/flash-attention#2299 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: chunked lm_head to avoid OOM on large-vocab models Avoids materializing the full (B, S, vocab_size) logits tensor during training by computing lm_head + logprobs in chunks along the sequence dimension. Each chunk uses gradient checkpointing so only one chunk's logits are in memory at a time. For Qwen3.5-9B (248K vocab) with 64K token sequences, this reduces peak logits memory from ~32 GiB to ~1.9 GiB (chunk_size=4096). Ported from upstream SkyRL's JAX backend (PR NovaSky-AI#902, loss_chunk_size). Also disables eval_before_train for faster prototyping. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: create v53 dataset (v52 minus github) and update default v53 removes all 428 github tasks from tool_use (3382→2954). Computer use unchanged (613 tasks, no github in v52). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: handle FSDP2 DTensor in chunked lm_head FSDP2 wraps parameters as DTensors. Calling lm_head module inside gradient_checkpoint caused DTensor/Tensor mismatch. Fix: - Extract weight/bias and convert DTensor→full_tensor (differentiable all-gather) - Use F.linear instead of module call - Skip gradient_checkpoint for ref model (no_grad context) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add flash-linear-attention for Qwen3.5 GatedDeltaNet kernels Without FLA, 24 of 32 layers fall back to unoptimized torch implementation making backward pass ~30x slower. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add flash-linear-attention to pyproject.toml dependencies Remove causal-conv1d from override-dependencies (which blocked it) and add both causal-conv1d and flash-linear-attention to main deps. Required for Ray workers to have FLA via uv runtime env hook. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: upgrade to vLLM 0.17.0 stable for Qwen3.5 support - vLLM 0.17.0 has native Qwen3.5/GDN support + FlashAttention 4 - No longer need vllm nightly, nvidia package fixups, or flashinfer cleanup - Pin torch==2.10.0 (from 2.9.0) to match vLLM 0.17.0 - FLA still in pyproject.toml for training model (HF transformers) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: remove flashinfer-jit-cache from vllm extra to fix version mismatch vLLM 0.17.0 brings flashinfer-python 0.6.4 but flashinfer-jit-cache resolves to 0.5.3, causing RuntimeError on engine startup. Remove jit-cache from vllm extra (keep for mcore only). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: fix black formatting in model_wrapper.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Deniz <deniz@Mac.localdomain>
Root overlay (200GB) ran out of space during checkpoint save. Volume (/workspace, 500GB) has plenty of room. Move: - checkpoints: /workspace/ckpts/ - eval dumps: /workspace/exports/ - dataset: /workspace/data/fleet/ - Ray tmp: /workspace/skyrl-tmp/ Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Adds trace config to FleetTaskEnv and uploads conversation traces (including screenshots) at episode end during eval. Trace job is created in trainer.eval() when FLEET_API_KEY is set. Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Adds `partial_reward` config option under `fleet_task` and passes it through to OpenEnv's FleetTaskEnv. Off by default. Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
- Point workdir to main (feat/qwen3.5-9b-tool-use merged) - Use OpenEnv@deniz/fleet_client (PR #1) instead of @deniz/fleet-logfire - Enable eval_before_train for step-0 baseline - Increase eval to 8 tasks/env (MAX_EVAL_PROMPTS 60→96, min_per_env 4→8) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When SkyRL ends a trajectory early (context overflow), the verifier never ran and the model got 0 reward. Now: - Fleet env exposes close_async() which calls OpenEnv's close_async() and reads final_reward from the verifier - Generator moves get_metrics() after _env_close() and injects final_reward into per_step_rewards for context-overflow trajectories Requires fleet-ai/OpenEnv feat/close-verifier branch. Co-authored-by: Deniz <deniz@Mac.localdomain> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* fix: raise ulimit for open files in training run script Ray + 8 vLLM engines + Fleet MCP connections exhaust the default 1024 file descriptor soft limit, causing "Too many open files" errors that hang training. Set ulimit -n 65536 at the start of the run block. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: also raise ulimit in VL training YAML Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* fix: raise ulimit for open files in training run script Ray + 8 vLLM engines + Fleet MCP connections exhaust the default 1024 file descriptor soft limit, causing "Too many open files" errors that hang training. Set ulimit -n 65536 at the start of the run block. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: also raise ulimit in VL training YAML Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update Qwen3.5-9B training config to dataset v54 v54 patches 12 cross-contaminated verifiers identified in the v53 eval set. See fleet-research-scripts/training-data-pipeline/v5/verifier-contamination-diagnosis.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix OpenEnv install to force-reinstall from git Version number doesn't bump on every commit, so uv caches the old wheel. Use --force-reinstall --no-cache-dir to always get latest. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* fix: pass@n requires full success (>= 1.0), not just positive reward With partial_reward mode, tasks can return fractional rewards (e.g. 0.3). The old `> 0` check counted these as passes, inflating pass@n metrics. Changed to `>= 1.0` so only fully solved tasks count as passes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: update per-dataset metric tests for strict pass@n threshold Tests were using fractional rewards (0.5, 0.7, 0.9) and expecting pass@n=1.0, but pass@n now requires >= 1.0 for a pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* feat: shared training scripts, multi-node support, Qwen3.5-35B-A3B config Extract ~170 lines of inline shell from the 9B SkyPilot config into reusable scripts (fleet-common-setup.sh, fleet-common-run.sh, fleet-qwen35-extra-setup.sh) with multi-node Ray support via head/worker pattern from GSM8K example. - Add Qwen3.5-35B-A3B MoE config (2-node default, 16 GPUs) - Enable GCP with correct H200/B200 image (driver 570) - Add num_nodes workflow input (1/2/4) with --num-nodes passthrough - Remove old Qwen3 configs from workflow choices Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: default OpenEnv branch to deniz/fleet_client Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: update README default branch to deniz/fleet_client Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: wire cloud input to sky launch --cloud flag Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: point workdir ref to branch for testing Scripts don't exist on main yet. Will revert to main before merge. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: install build-essential if c++ missing (GCP images) causal-conv1d needs c++ compiler for CUDA extension build. GCP deep learning images don't include build-essential by default. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use sudo for apt-get on GCP (runs as gcpuser, not root) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: resolve extra-setup path to absolute before cd skyrl-train The --extra-setup path is relative to repo root but cd skyrl-train changes the working directory, breaking the relative path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add --no-pytorch-alloc-conf to 35B config expandable_segments is incompatible with vLLM 0.17.0 memory pool. The 9B config already passes this flag; 35B was missing it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: clean stale runner state before checkout to prevent log file conflicts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add --use-python-direct and --set-ulimit to 35B config uv run --isolated creates a temp env that doesn't have flash_attn_2_cuda, causing ModuleNotFoundError at runtime. --use-python-direct runs from the venv directly where the flash-attn wheel was installed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add KillMode=control-group to runner setup to prevent zombie listeners The default KillMode=process only kills runsvc.sh on restart, leaving RunnerService.js and Runner.Listener children orphaned. With Restart=always and WatchdogSec=300, this accumulates dozens of zombie listeners that fight over _work/_temp/_runner_file_commands/, causing checkout failures. KillMode=control-group kills the entire cgroup on restart. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add self-healing process health check to runner-health workflow SSHs into each healthy runner every 15 min to validate KillMode=control-group and kill zombie Runner.Listener processes. Prevents checkout failures from accumulating orphaned listeners. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: remove nightly transformers install that breaks vLLM 0.17.0 The transformers main branch renamed `layer_type_validation` to `validate_layer_type()`, breaking vLLM's qwen3_5_moe config import. The locked version (4.57.3) has both Qwen3.5 support and the API vLLM expects, so the nightly install is unnecessary and harmful. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Now that shared scripts are merged, the SkyPilot task YAMLs should clone from main instead of the feature branch. Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
) * fix: pin transformers==5.1.0 and disable enforce_eager for Qwen3.5 The FSDP worker failed with `qwen3_5_moe` unrecognized because the resolved transformers version didn't include Qwen3.5-MoE support. Pin to 5.1.0 in the extra-setup script. Also add `generator.enforce_eager=false` to both 9B and 35B task configs to allow CUDA graph compilation instead of eager mode. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: point workdir ref to main for both task configs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Deniz <deniz@Mac.localdomain> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* fix: pin transformers==4.57.3 (5.x renames qwen3_5_moe to qwen3_5_moe_text) transformers 5.1.0 renamed the model type from `qwen3_5_moe` to `qwen3_5_moe_text`, so AutoConfig.from_pretrained fails on the HF checkpoint which still uses `qwen3_5_moe`. 4.57.3 has the correct mapping. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use transformers==5.3.0 (first version with qwen3_5_moe support) 4.57.3 predates Qwen3.5 (Nov 2025 vs Feb 2026). 5.1.0 doesn't register qwen3_5_moe in AUTO_CONFIG_MAPPING. 5.3.0 is confirmed to have full support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Deniz <deniz@Mac.localdomain> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…at (#301) On RunPod /workspace is persistent storage, but on GCP it doesn't exist and mkdir /workspace fails with permission denied. Both setup and run scripts now auto-detect: use /workspace if it exists and is writable, otherwise fall back to $HOME. Explicit --data-root still works as override. Also moves ckpt_path and export_path into the run script's common hydra overrides so they use the resolved CKPT_ROOT instead of hardcoded paths. Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* Add hint-augmented rollouts to rescue GRPO signal on dead prompts
When all raw rollouts for a prompt score below threshold (default 0.0),
build a hint from verifier feedback (ERROR/SUCCESS_ACCUMULATOR + tool
errors) and run additional hinted rollouts. Hinted samples share the
same instance_id so GRPO groups them with raw samples, creating reward
variance where there was none.
No LLM call — hints are formatted verifier feedback. New env instances
per hinted rollout (no state leakage). Disabled by default
(enable_hints: false).
Config: enable_hints, hint_reward_threshold, n_hint_samples
Metrics: hint/prompts_hinted, hint/hint_success_rate, hint/signal_rescued
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: guard hint metrics against non-dict env_metrics
MagicMock objects from test mocks are truthy for .get("is_hinted"),
causing TypeError when comparing MagicMock > 0. Add isinstance check.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: skip hint augmentation during eval
Hints are for rescuing GRPO training signal on dead prompts. During
eval, we want true model capability without hints. At step 0, nearly
all prompts fail, causing massive hint rollout storms that OOM the
raylet. Gate on sampling_params is None (training mode).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: use batch_metadata.training_phase to gate hints instead of sampling_params
sampling_params is never None — both training and eval pass a dict via
get_sampling_params_for_backend(). The previous guard (sampling_params is None)
silently disabled hint augmentation in all cases. Use batch_metadata.training_phase
== "train" which correctly distinguishes training from eval.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: use strict > for hint threshold so hints fire when all rewards are 0
With >= 0.0, prompts where all 4 samples scored exactly 0.0 were skipped
because 0.0 >= 0.0 is true. Changed to > so that threshold=0.0 means
"generate hints when max_reward is 0" (the intended behavior).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: re-derive uids when hint augmentation expands generator output
When hint augmentation adds extra rollouts (e.g. 96→116), the uids list
from prepare_generator_input has fewer entries than the generator output.
Re-derive uids from the input trajectory_ids (which the generator mutates
in-place when appending hinted rollouts) to fix the IndexError.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: include trajectory_ids in generator output for hint augmentation
The generator was reassigning trajectory_ids to a new list when appending
hints, so the trainer couldn't access the extended list. Two fixes:
1. Generator: extend trajectory_ids/env_classes in-place and always
include trajectory_ids in the output (was None for non-step-wise).
2. Trainer: re-derive uids from generator_output["trajectory_ids"]
when output size differs from input (hint augmentation adds rollouts).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* RLTF-SD: strip hint from training prompt for correct gradient
Replace hinted rollout prompt_ids with the original unhinted prompt_ids
so GRPO trains ∇θ log π(y_hint | x_0) instead of ∇θ log π(y_hint | x_0 + hint).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Log injected hint text for each hinted rollout
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat: first-turn baseline for hint-augmented GRPO
Compute GRPO group-mean baseline from raw (unhinted) samples only,
preventing hinted samples from contaminating the baseline and causing
training instability (RLTF-SD paper, Section 3.2).
- Generator: emit `is_hinted` boolean array in output
- Trainer: thread `is_hinted` through metadata and into advantage fn
- ppo_utils: use raw-only mean/std when `is_hinted` is provided
- Tests: 4 new tests covering basic, no-hints, mixed-groups, std-norm
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Deniz <deniz@Mac.localdomain>
* debug: add diagnostics and disable Ray memory monitor for GCP FSDP crash FSDP ref workers are SIGKILL'd during model init on GCP (works on RunPod). Both 9B single-node and 35B multi-node affected. Changes: - fleet-common-run.sh: dump system diagnostics (cgroup limits, memory, GPU info) before training, disable Ray memory monitor (RAY_DISABLE_MEMORY_MONITOR=1), add NCCL_DEBUG=WARN, capture dmesg and memory state on training failure - 9b YAML: add missing ckpt_path/export_path hydra overrides (matching 35B YAML pattern with $HOME instead of /workspace) If the crash disappears with RAY_DISABLE_MEMORY_MONITOR=1, Ray's memory monitor is the root cause. If it persists, dmesg output will show whether it's OOM killer, segfault, or something else. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * debug: point workdir ref to debug branch for GCP FSDP crash diagnostics The previous run used ref: main, so fleet-common-run.sh changes (RAY_DISABLE_MEMORY_MONITOR, system diagnostics, crash diagnostics) were never deployed. Point to debug branch temporarily. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * debug: improve FSDP crash diagnostics - Always dump post-training diagnostics (Hydra exits 0 even on crash) - Add HYDRA_FULL_ERROR=1 for complete stack traces - Add fabric manager status check (NVSwitch on H200/B200) - Add GPU topology dump (nvidia-smi topo -m) - Add NVIDIA driver/CUDA version info - Upgrade NCCL_DEBUG to INFO with INIT,NET subsystems - Increase dmesg capture to 80 lines, add "traps:" pattern Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * debug: replace training with FSDP diagnostic test on GCP Replaces the training run with a comprehensive diagnostic script: - System diagnostics (memory, cgroup, fabric manager, GPU topology) - Step-by-step FSDP tests (NCCL init, broadcast, model load, FSDP2 wrap) - Multi-GPU FSDP via Ray (reproduces the crash scenario) - Post-test dmesg capture for OOM/segfault analysis Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: disable NCCL NVLS to prevent FSDP worker SIGKILL on GCP H200 Root cause: NCCL's NVLink SHARP (NVLS) feature causes all workers to SIGKILL when Fabric Manager isn't properly reset after VM creation on GCP. This is a known issue: NVIDIA/nccl#1562 Fix: export NCCL_NVLS_ENABLE=0 before training. Also reverts ref back to main (ref: debug/fsdp-gcp-crash caused SkyPilot SIGABRT exit 134 on the GH Actions runner). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: disable NCCL NVLS to prevent FSDP worker SIGKILL on GCP H200/B200 Root cause: NCCL's NVLink SHARP (NVLS) feature causes all FSDP workers to be SIGKILL'd when Fabric Manager isn't properly reset after VM creation. This is a known issue with H200/B200 GPUs on cloud providers. See: NVIDIA/nccl#1562 Changes: - Add NCCL_NVLS_ENABLE=0 to fleet-common-run.sh (applies to all tasks) - Add NCCL_NVLS_ENABLE=0 to 9B and 35B YAML run blocks (belt + suspenders) - Add RAY_DISABLE_MEMORY_MONITOR=1 to prevent spurious Ray worker kills - Add system diagnostics and crash diagnostics to fleet-common-run.sh - Set HYDRA_FULL_ERROR=1 for complete stack traces on failure Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * debug: inline-patch FSDP code with GPU memory diagnostics on GCP NCCL_NVLS_ENABLE=0 didn't fix the FSDP worker SIGKILL on GCP. The root cause is still unknown. This adds comprehensive diagnostics: - Pre-training: system memory, cgroup limits, GPU info, fabric manager - Inline Python patches to fsdp_utils.py and fsdp_strategy.py that add GPU memory logging at every stage of fsdp2_load_full_state_dict - Background nvidia-smi dmon for continuous GPU memory monitoring - Post-training: dmesg, memory state, GPU state - Env vars: RAY_DISABLE_MEMORY_MONITOR=1, NCCL_DEBUG=INFO Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: disable NCCL P2P when Fabric Manager is down on GCP Root cause found: Fabric Manager is FAILED on GCP spot B200/H200 VMs. Without FM, NVSwitch/NVLink P2P communication crashes with SIGKILL during the first NCCL dist.broadcast() in fsdp2_load_full_state_dict. Diagnostic data from run 23234079453: - GPU: NVIDIA B200, 183 GiB, driver 570.211.01 - Fabric Manager: FAILED - GPU memory at crash: ~6 GiB used, ~175 GiB free (not OOM) - dmesg: empty (not kernel OOM killer) - Crash: during first dist.broadcast() of 760 FSDP params Fix: Check FM status at startup and set NCCL_P2P_DISABLE=1 if FM is not active. This forces NCCL to use shared memory transport instead of NVLink/NVSwitch, which works without FM (slower but functional). Also attempts to start FM before falling back. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * debug: remove NCCL overrides, add FM debugging and NCCL shim inspection GCP has a custom NCCL shim (/nccl-shim/) that manages all NCCL configuration. Our manual env var overrides (NCCL_P2P_DISABLE, NCCL_NVLS_ENABLE, NCCL_CUMEM_ENABLE) conflict with the shim and may cause the worker SIGKILL during dist.broadcast(). Changes: - Remove ALL NCCL env var overrides (unset them explicitly) - Add Fabric Manager deep debugging (journalctl, config, direct invocation) - Add NCCL shim inspection (contents, config, libraries) - Add pre-training NCCL communication test - Add NVLink status check - Keep FSDP diagnostic patches for GPU memory monitoring Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * debug: simplify YAML to fix SkyPilot SIGABRT (remove Python heredocs) SkyPilot was crashing with exit code 134 (SIGABRT) during launch, likely due to the large YAML run block with embedded Python heredocs. Simplified to only essential diagnostics and training command. Key change: no NCCL env vars set (let GCP NCCL shim manage config). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * debug: temporarily disable GCP (SkyPilot SIGABRT on GCP VM creation) SkyPilot crashes with exit code 134 (SIGABRT) during GCP VM creation. Temporarily remove GCP resource blocks to test on RunPod/Lambda/Nebius. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: skip NCCL_CUMEM_ENABLE on GCP to prevent FSDP worker SIGKILL Root cause: GCP has a custom NCCL shim (/nccl-shim/) that manages all NCCL configuration. Setting NCCL_CUMEM_ENABLE=0 (done by prepare_runtime_environment() for vLLM compat) conflicts with the shim and causes FSDP ref workers to be SIGKILL'd during the first dist.broadcast() call in fsdp2_load_full_state_dict(). Changes: - utils.py: Detect GCP (/nccl-shim or /usr/local/gib) and skip setting NCCL_CUMEM_ENABLE=0 when on GCP - fleet-common-run.sh: Remove all NCCL env var overrides (NVLS, P2P, DEBUG), improve Fabric Manager restart (add persistence mode) - 9B YAML: Restore GCP resource blocks, remove all debug diagnostics - 35B YAML: Remove NCCL_NVLS_ENABLE=0 from run block Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: reset SkyPilot API state on runner to prevent SIGABRT The self-hosted runner accumulates stale SkyPilot cluster refs that cause SIGABRT during sky launch. Clean up the API server state at the start of each run, and refresh cluster status after cloud checks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: start SkyPilot API after gcloud install so GCP is detected The API server captures PATH at startup. Starting it before gcloud is installed means sky check can't find GCP tools. Move api start to the Verify step (after Configure Cloud Credentials installs gcloud). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * revert: restore original workflow Clean/Verify steps The aggressive SkyPilot cleanup (pkill, rm -rf api_server) was causing the runner to fail. Revert to the original workflow structure that was working in run 23232808254. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: use debug branch for workdir to test NCCL fix on GCP The workdir.ref was 'main' which doesn't have the NCCL_CUMEM_ENABLE skip in utils.py. Point to debug/fsdp-gcp-crash so the GCP VM uses our fix. Will revert to 'main' after validation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * debug: add GCP detection logging and NCCL/FM diagnostics Workers still SIGKILL'd on GCP even with NCCL_CUMEM fix. Need to confirm: (1) GCP detection paths exist, (2) NCCL_CUMEM_ENABLE is actually skipped, (3) Fabric Manager status, (4) NCCL shim presence. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * debug: robust GCP detection (DMI) + NCCL_DEBUG on GCP workers Previous /nccl-shim detection may fail on B200 a4-highgpu VMs. Add DMI product_name check ('Google Compute Engine') as fallback. Also enable NCCL_DEBUG=INFO on GCP workers to capture communication errors before SIGKILL. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: aggressive Fabric Manager restart on GCP spot VMs Root cause confirmed: Fabric Manager fails to start on GCP spot VMs (status: 'failed'), causing dist.broadcast() SIGKILL during FSDP init. FM is required for NVLink P2P on NVSwitch GPUs (B200, H200 SXM). - Add multi-attempt FM restart with full GPU reset cycle - Add comprehensive FM failure diagnostics (journalctl, driver info) - Keep NCCL_CUMEM_ENABLE skip on GCP as secondary fix Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: disable NCCL P2P when Fabric Manager fails, propagate to workers When Fabric Manager can't start on GCP spot VMs, set NCCL_P2P_DISABLE=1 and NCCL_NVLS_ENABLE=0 as fallback. This forces NCCL to use shared memory transport instead of NVLink, which is slower but avoids SIGKILL during dist.broadcast(). Also propagate shell-level NCCL overrides to Ray workers via runtime_env to ensure all FSDP workers see the settings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: don't override NCCL config on GCP — host manages NVSwitch GCP's a4-highgpu/a3-ultragpu VMs manage NVSwitch at the host level. The guest VM has no NVSwitch devices, so Fabric Manager correctly reports "NV_WARN_NOTHING_TO_DO" and can't start — this is expected. NVLink P2P works through GCP's host-managed fabric without FM. GCP provides a custom NCCL shim (gIB) whose Guest Config Checker expects NCCL_P2P_DISABLE, NCCL_NVLS_ENABLE, and NCCL_CUMEM_ENABLE to be UNSET. Setting any of these breaks FSDP dist.broadcast(). Previous attempt disabled P2P as an FM-failure fallback, which actually caused the crash by forcing NCCL away from the functional NVLink path into gIB (inter-node only). Changes: - fleet-common-run.sh: Skip FM restart on GCP, keep it for non-GCP - utils.py: Skip all NCCL env var overrides on GCP Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * debug: comprehensive diagnostics for FSDP broadcast SIGKILL on GCP - Add NCCL test broadcast before weight loading loop (fsdp_utils.py) - Log /dev/shm size, GPU memory, progress during broadcast - Set NCCL_DEBUG=INFO on GCP workers for transport visibility - Expand shell diagnostics: /dev/shm, GPU topology, cgroups, ulimits - Remount /dev/shm to 16G on GCP if too small (preventive) - Capture full dmesg + cgroup events + Ray worker logs on crash Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: source gIB NCCL env vars on GCP to prevent Config Checker SIGKILL Root cause: GCP's gIB plugin includes a Config Checker that validates NCCL env vars at the first collective operation. If vars don't match expected values, it SIGKILLs the process — explaining why FSDP workers die during dist.broadcast() (the first real multi-GPU collective). Fix: - Source /usr/local/gib/scripts/set_nccl_env.sh before Ray start - Forward all NCCL_* env vars to Ray workers via runtime_env - Add /usr/local/gib/lib64 to LD_LIBRARY_PATH - Set NCCL_CUMEM_HOST_ENABLE=0 (driver 570 cuMem bug workaround) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: disable gIB on GCP single-node to prevent NCCL broadcast crash Root cause: GCP deep learning images install /etc/profile.d/nccl_env.sh which auto-sets NCCL_NET=gIB and adds /usr/local/gib/lib64 to LD_LIBRARY_PATH. The gIB plugin requires RDMA/InfiniBand (/dev/infiniband) for inter-node communication. On instances without RDMA devices, gIB fails to initialize → "Failed to initialize any NET plugin" → SIGKILL during the first dist.broadcast() in FSDP weight loading. For single-node training, gIB is unnecessary — intra-node communication uses NVLink P2P directly. Fix: strip gIB from LD_LIBRARY_PATH and unset NCCL_NET before starting Ray, so NCCL falls back to NVLink P2P + Socket. Verified on GCP a3-ultragpu-8g (H200): - WITH gIB forced (NCCL_NET=gIB): dist.broadcast crashes - WITHOUT gIB: all 8-GPU broadcasts pass Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * cleanup: remove FSDP broadcast diagnostic code Root cause identified and fixed — no longer need the test broadcast, /dev/shm check, or progress logging in fsdp2_load_full_state_dict. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: strip gIB when RDMA hardware absent (not just single-node) SkyPilot provisions GCP VMs with a single management NIC — no RDMA networking. gIB requires ConnectX NICs + GPUDirect VPC networks. Check /sys/class/infiniband instead of node count so multi-node training also works (falls back to NVLink P2P + Socket/TCP). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * add GCP on-demand fallback for B200/H200 Spot zones are frequently stocked out. Add on-demand options so SkyPilot can fall back when spot is unavailable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * revert workdir ref to main for merge Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add GKE spot node pools (H200 + B200) with GPUDirect-RDMA Adds GKE kubernetes entries with use_spot to both 9B and 35B task YAMLs targeting fleet-rdma cluster. 35B includes network_tier: best for RDMA inter-node networking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: update gIB comments — multi-node uses GKE RDMA, not TCP fallback Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Remove num_nodes workflow input — use YAML-defined value num_nodes is a property of the task YAML, not a launch parameter. The 9B config uses 1 node, 35B uses 2 nodes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* 35B: enable partial_reward, increase context to 96K, update to v55 - partial_reward=true: denser gradient signal, stabilized 9B iter#4 - max_input_length 64K→96K: carlisle/wallst/budget length-limited at 64K - v55 dataset: remaining verifier contamination fixes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * 35B: TP=2 for vLLM engines to handle 96K context 8 engines with 2 GPUs each instead of 16 single-GPU engines. More memory per engine for KV cache, faster prefill on long sequences. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Deniz <deniz@Mac.localdomain> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…12.8 causal-conv1d 1.5.x has a NameError (bare_metal_version not defined) in setup.py when building from source with newer CUDA versions. 1.6.0+ fixes the CUDA version detection logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a massive restructuring of the project, moving towards a modular skyrl full-stack library and adding the skyrl-agent component. It includes significant updates to documentation, Docker infrastructure, and adds numerous examples and data processing scripts. The changes are extensive and set a new foundation for the project. My review focuses on improving maintainability and portability by addressing issues like code duplication in Dockerfiles and hardcoded paths in scripts and documentation. I've also identified a potential bug in the GHA runner health check script.
| FROM anyscale/ray:2.51.1-slim-py312-cu128 | ||
|
|
||
| RUN pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable | ||
| RUN sudo apt-get update -y && sudo apt-get install -y wget kmod libxml2 build-essential libnuma-dev | ||
|
|
||
| RUN cd /opt/nvidia && git clone --single-branch --branch core_r0.11.0 https://github.com/NVIDIA/Megatron-LM.git Megatron-LM | ||
| # the cuda compiler here is needed for deepspeed | ||
| RUN wget https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_570.86.10_linux.run \ | ||
| && sudo sh cuda_12.8.0_570.86.10_linux.run --silent --toolkit && rm -rf cuda_12.8.0_570.86.10_linux.run | ||
|
|
||
| # only config pip index with https://pypi.tuna.tsinghua.edu.cn/simple if needed | ||
| # unset for now | ||
| RUN cd /opt/nvidia/Megatron-LM && pip3 install --no-deps -e . No newline at end of file | ||
| RUN curl -LsSf https://astral.sh/uv/0.9.4/install.sh | sh | ||
| RUN echo "export RAY_RUNTIME_ENV_HOOK=ray._private.runtime_env.uv_runtime_env_hook.hook" >> /home/ray/.bashrc | ||
|
|
||
|
|
||
| RUN sudo apt-get update \ | ||
| && sudo apt-get install -y openssh-server iputils-ping net-tools iproute2 traceroute netcat \ | ||
| libopenexr-dev libxi-dev libglfw3-dev libglew-dev libomp-dev libxinerama-dev libxcursor-dev tzdata \ | ||
| && sudo apt-get clean && sudo rm -rf /var/lib/apt/lists/* | ||
|
|
||
| RUN sudo apt update && sudo apt install --fix-broken && sudo apt install -y default-jre-headless openjdk-8-jdk \ | ||
| && sudo apt-get clean \ | ||
| && sudo rm -rf /var/lib/apt/lists/* |
There was a problem hiding this comment.
There is significant code duplication between this Dockerfile and docker/Dockerfile. The first 20 lines are nearly identical. This makes maintenance difficult, as changes need to be applied in multiple places. Consider creating a common base Dockerfile (e.g., Dockerfile.base) and have Dockerfile, Dockerfile.megatron, and Dockerfile.ray244 build FROM it. This would centralize the common setup steps like installing system packages, CUDA toolkit, and uv.
| FROM anyscale/ray:2.44.0-slim-py312-cu128 | ||
|
|
||
| RUN sudo apt-get update -y && sudo apt-get install -y wget kmod libxml2 build-essential libnuma-dev | ||
|
|
||
| # the cuda compiler here is needed for deepspeed | ||
| RUN wget https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_570.86.10_linux.run \ | ||
| && sudo sh cuda_12.8.0_570.86.10_linux.run --silent --toolkit && rm -rf cuda_12.8.0_570.86.10_linux.run | ||
| RUN curl -LsSf https://astral.sh/uv/install.sh | sh | ||
| RUN echo "export RAY_RUNTIME_ENV_HOOK=ray._private.runtime_env.uv_runtime_env_hook.hook" >> /home/ray/.bashrc | ||
| RUN sudo apt-get update \ | ||
| && sudo apt-get install -y openssh-server iputils-ping net-tools iproute2 traceroute netcat \ | ||
| libopenexr-dev libxi-dev libglfw3-dev libglew-dev libomp-dev libxinerama-dev libxcursor-dev tzdata \ | ||
| && sudo apt-get clean && sudo rm -rf /var/lib/apt/lists/* | ||
| RUN sudo apt update && sudo apt install --fix-broken && sudo apt install -y default-jre-headless openjdk-8-jdk \ | ||
| && sudo apt-get clean \ | ||
| && sudo rm -rf /var/lib/apt/lists/* |
There was a problem hiding this comment.
| log() { | ||
| echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | sudo tee -a "$LOG_FILE" >/dev/null | ||
| } |
There was a problem hiding this comment.
The log function attempts to write to /var/log/runner-health-check.log using sudo tee. However, the setup-gha-runner.sh script installs the cron job for the current user, not root. When the cron job runs, the sudo command will fail because it cannot ask for a password, leading to the health check script failing to log anything. The log file should be written to a user-writable location, or the cron job should be installed in the system-wide crontab to run as root.
| log() { | |
| echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | sudo tee -a "$LOG_FILE" >/dev/null | |
| } | |
| log() { | |
| echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" >> "$LOG_FILE" | |
| } |
| - Fleet SDK repo: `/Users/deniz/repos/fleet-sdk` | ||
| - OpenEnv repo: `/Users/deniz/repos/OpenEnv` |
There was a problem hiding this comment.
The file paths for fleet-sdk and OpenEnv are hardcoded to a specific user's local directory (/Users/deniz/...). This makes the instructions not generally usable for other developers. These should be replaced with placeholder paths or instructions on how to set them up.
| - Fleet SDK repo: `/Users/deniz/repos/fleet-sdk` | |
| - OpenEnv repo: `/Users/deniz/repos/OpenEnv` | |
| - Fleet SDK repo: `<path/to/your/local/fleet-sdk>` | |
| - OpenEnv repo: `<path/to/your/local/OpenEnv>` | |
|
|
||
| ```bash | ||
| # 1. Launch EC2 (copy config from existing runner) | ||
| aws ec2 run-instances --image-id ami-0c7217cdde317cfec --instance-type t3.xlarge --key-name gha-runner-key --security-group-ids sg-00fefd8181d51909d --subnet-id subnet-03879810067f57f85 --block-device-mappings '[{"DeviceName":"/dev/sda1","Ebs":{"VolumeSize":50,"VolumeType":"gp3"}}]' --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=fleet-runner-N}]' |
There was a problem hiding this comment.
The documentation contains hardcoded AWS resource IDs, such as AMI ID, security group ID, subnet ID, and instance IDs (also on line 76). This makes the commands not directly usable and prone to copy-paste errors. These should be replaced with placeholders (e.g., <YOUR_AMI_ID>) to make it clear that users need to substitute their own values.
| aws ec2 run-instances --image-id ami-0c7217cdde317cfec --instance-type t3.xlarge --key-name gha-runner-key --security-group-ids sg-00fefd8181d51909d --subnet-id subnet-03879810067f57f85 --block-device-mappings '[{"DeviceName":"/dev/sda1","Ebs":{"VolumeSize":50,"VolumeType":"gp3"}}]' --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=fleet-runner-N}]' | |
| aws ec2 run-instances --image-id <YOUR_AMI_ID> --instance-type t3.xlarge --key-name <YOUR_KEY_NAME> --security-group-ids <YOUR_SECURITY_GROUP_ID> --subnet-id <YOUR_SUBNET_ID> --block-device-mappings '[{"DeviceName":"/dev/sda1","Ebs":{"VolumeSize":50,"VolumeType":"gp3"}}]' --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=fleet-runner-N}]' |
| os.environ["OPENAI_API_KEY"] = "sc" | ||
| model = "Qwen/Qwen3-32B" | ||
| tokenizer = AutoTokenizer.from_pretrained(model) | ||
| dataset_file = "/data/sycao/r2e-all/train.parquet" |
There was a problem hiding this comment.
The path to the dataset is hardcoded. This makes the script difficult to run in different environments. This path should be configurable, for example, via a command-line argument or an environment variable.
| dataset_file = "/data/sycao/r2e-all/train.parquet" | |
| dataset_file = os.environ.get("DATASET_FILE", "/data/sycao/r2e-all/train.parquet") |
| DATA_DIR="/mnt/shared_storage/datasets/r2e-all" | ||
| TRAIN_DATA="${DATA_DIR}/train.parquet" | ||
| VAL_DATA="${DATA_DIR}/validation.parquet" | ||
|
|
||
| CKPT_DIR=$HOME/ckpts | ||
| EXPORT_DIR=$HOME/exports |
There was a problem hiding this comment.
The script contains hardcoded paths for datasets and checkpoint directories (e.g., /mnt/shared_storage/datasets/r2e-all, $HOME/ckpts). This makes the script not portable and difficult for other developers to run without modification. It's better to use environment variables for these paths with sensible defaults, allowing users to easily override them.
| DATA_DIR="/mnt/shared_storage/datasets/r2e-all" | |
| TRAIN_DATA="${DATA_DIR}/train.parquet" | |
| VAL_DATA="${DATA_DIR}/validation.parquet" | |
| CKPT_DIR=$HOME/ckpts | |
| EXPORT_DIR=$HOME/exports | |
| DATA_DIR="${DATA_DIR:-/mnt/shared_storage/datasets/r2e-all}" | |
| TRAIN_DATA="${TRAIN_DATA:-${DATA_DIR}/train.parquet}" | |
| VAL_DATA="${VAL_DATA:-${DATA_DIR}/validation.parquet}" | |
| CKPT_DIR="${CKPT_DIR:-$HOME/ckpts}" | |
| EXPORT_DIR="${EXPORT_DIR:-$HOME/exports}" |
Uh oh!
There was an error while loading. Please reload this page.