Skip to content

Conversation

@JRMeyer
Copy link
Contributor

@JRMeyer JRMeyer commented Nov 29, 2025

Summary

This PR enables proper GRPO training with importance sampling when using offline trajectory data (e.g., from vLLM traces). It includes four complementary changes:

1. Extract logprobs from dict messages

Problem: ART's tokenizer only extracted logprobs from OpenAI Choice objects, but offline trajectory data often stores logprobs in plain Python dicts. This caused all dict message logprobs to be set to NaN, making the importance ratio = 1.0 always (effectively REINFORCE instead of GRPO).

Solution: Modified tokenize.py to also extract logprobs from dict messages that have the format {"logprobs": {"content": [{"logprob": -0.5}, ...]}}.

2. Strip logprobs before RULER scoring

Problem: When trajectories contain verbose logprobs data, sending them to the RULER judge causes context length errors.

Solution: Strip logprobs from trajectories before sending to RULER using strip_logprobs().

3. Preserve _internal_config.engine_args

Problem: When using TrainableModel._internal_config.engine_args to configure vLLM engine settings (like max_logprobs), the configuration was silently lost when using the SkyPilot backend.

Solution: Add a model_validator(mode="wrap") to preserve _internal_config during Pydantic deserialization.

4. Add importance sampling observability metrics

Problem: ART computes importance sampling ratios internally but doesn't expose them, making it impossible to verify if importance sampling is actually working.

Solution: Add three new metrics logged during training:

  • frac_old_logprobs_valid: Fraction of old logprobs that are not NaN (0 = no importance sampling)
  • mean_importance_ratio: Mean π_new/π_old across assistant tokens (should vary around 1.0)
  • clip_fraction: Fraction of tokens where PPO clipping was triggered (>0 means off-policy correction active)

Impact

Aspect Before After
Importance ratio 1.0 always (for dict messages) π_new / π_old
PPO clipping Never activates Activates when ratio outside [0.8, 1.2]
Algorithm REINFORCE GRPO with importance sampling
Observability None frac_old_logprobs_valid, mean_importance_ratio, clip_fraction

New Metrics Interpretation

Metric Working (GRPO) Not Working (REINFORCE)
frac_old_logprobs_valid > 0 (close to 1.0) = 0 (all NaN)
mean_importance_ratio varies around 1.0 exactly 1.0
clip_fraction > 0 = 0

Test plan

  • Verified max_logprobs setting works with SkyPilot backend
  • Ran ./scripts/run_checks.sh - all checks pass
  • Test with training that uses offline trajectory data with logprobs
  • Verify new metrics appear in training logs/wandb

@JRMeyer JRMeyer force-pushed the fix/warn-engine-args-in-openai-server-config branch from dd383f5 to 5b26fd9 Compare December 1, 2025 17:31
@JRMeyer JRMeyer changed the title fix: preserve _internal_config.engine_args when using SkyPilot backend feat: enable GRPO training with logprobs from offline trajectory data Dec 1, 2025
JRMeyer and others added 7 commits December 2, 2025 20:47
Add a runtime warning when users pass engine-initialization-only
arguments (max_logprobs, gpu_memory_utilization, tensor_parallel_size,
max_model_len) via OpenAIServerConfig.engine_args.

These arguments are silently ignored because the vLLM engine is
initialized by Unsloth before OpenAIServerConfig is applied.
The warning guides users to use TrainableModel._internal_config
instead.
The _internal_config field was being lost when TrainableModel was
deserialized from JSON (e.g., when sent from client to SkyPilot backend).
This is because Pydantic ignores fields starting with underscore during
model_validate().

Added a model_validator(mode="wrap") that extracts _internal_config from
the input data before validation and sets it after the model is created.

This fixes the "Cannot request more than 0 logprobs" error when using
_internal_config.engine_args with remote backends.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Adds three new metrics logged during training to help users verify
that importance sampling is working correctly:

- frac_old_logprobs_valid: Fraction of old logprobs that are not NaN
- mean_importance_ratio: Mean π_new/π_old across assistant tokens
- clip_fraction: Fraction of tokens where PPO clipping was triggered

These metrics help diagnose whether GRPO/PPO importance sampling is
active or if training has fallen back to vanilla REINFORCE (when all
logprobs are NaN).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@JRMeyer JRMeyer force-pushed the fix/warn-engine-args-in-openai-server-config branch from 63d68c0 to e859dd9 Compare December 3, 2025 01:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant