pytorch
diff --git a/‎.gitignore
Lines changed: 1 addition & 0 deletions b/‎.gitignore
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/source/reference/llms.rst
Lines changed: 23 additions & 1 deletion b/‎docs/source/reference/llms.rst
Lines changed: 23 additions & 1 deletion
diff --git a/‎sota-implementations/grpo/README.md
Lines changed: 47 additions & 8 deletions b/‎sota-implementations/grpo/README.md
Lines changed: 47 additions & 8 deletions
diff --git a/‎sota-implementations/grpo/config/grpo_gsm8k.yaml
Lines changed: 64 additions & 36 deletions b/‎sota-implementations/grpo/config/grpo_gsm8k.yaml
Lines changed: 64 additions & 36 deletions
@@ -44,6 +44,7 @@ htmlcov/
 .coverage
 .coverage.*
 .cache
+.neptune
 nosetests.xml
 coverage.xml
 *.cover
 
@@ -10,9 +10,29 @@ TorchRL offers a set of tools for LLM post-training, as well as some examples fo
 Collectors
 ----------
 
-TorchRL offers a specialized collector class (:class:`~torchrl.collectors.llm.LLMCollector`) that is tailored for LLM
+TorchRL offers specialized collector classes (:class:`~torchrl.collectors.llm.LLMCollector` and :class:`~torchrl.collectors.llm.RayLLMCollector`) that are tailored for LLM
 use cases. We also provide dedicated updaters for some inference engines.
 
+LLM Collectors allow to track the version of the policy, which is useful for some use cases.
+This is done by adding a :class:`~torchrl.envs.llm.transforms.PolicyVersion` transform to the environment, which is
+then incremented by the collector after each weight update. To do this, one either provides the stateful version of the
+transform, or a boolean to the collector constructor.
+
+    >>> from torchrl.envs.llm.transforms import PolicyVersion
+    >>> from torchrl.collectors.llm import LLMCollector
+    >>> from torchrl.collectors.llm.weight_update import vLLMUpdater
+    >>> env = make_env() # place your code here
+    >>> policy = make_policy() # place your code here
+    >>> collector = LLMCollector(env, policy=policy, weight_updater=vLLMUpdater(), track_policy_version=True)
+    >>> # init the updater
+    >>> collector.weight_updater.init(...)
+    >>> # the version is incremented after each weight update
+    >>> collector.update_policy_weights_(state_dict=...)
+    >>> print(collector.policy_version_tracker.version)
+    >>> # the policy version is written in the data
+    >>> for data in collector:
+    ...     print(data["policy_version"])
+
 .. currentmodule:: torchrl.collectors.llm
 
 .. autosummary::
@@ -21,6 +41,7 @@ use cases. We also provide dedicated updaters for some inference engines.
 
     vLLMUpdater
     LLMCollector
+    RayLLMCollector
 
 
 Data structures
@@ -182,6 +203,7 @@ transforms).
     MCPToolTransform
     BrowserTransform
     PythonInterpreter
+    PolicyVersion
     TemplateTransform
     Tokenizer
     as_nested_tensor
 
@@ -11,6 +11,7 @@ GRPO is a method for training language models using reinforcement learning, with
 - Automatic checkpointing
 - Comprehensive logging with Weights & Biases
 - Hydra configuration system
+- Asynchronous training support with Ray
 
 ## Installation
 
@@ -34,7 +35,27 @@ export VLLM_USE_V1=0  # Required for vLLM compatibility
   - vLLM inference device
   - Reference model device
 
-Devices can be controlled via the `training_model.devices`, `inference_model.devices` and `ref_model.devices` arguments.
+### Device Management
+
+There are two ways to specify device allocation:
+
+1. Using `num_devices` (Recommended):
+```bash
+train_model.num_devices=2 ref_model.num_devices=2 inference_model.num_devices=2
+```
+This approach automatically manages device allocation based on the training mode (sync/async) and prevents device conflicts.
+
+2. Using `devices` (Manual):
+```bash
+train_model.devices=[0,1] ref_model.devices=[2,3] inference_model.devices=[4,5]
+```
+This approach requires manual device management and is more error-prone.
+
+The `num_devices` approach is recommended as it:
+- Automatically handles device allocation
+- Works correctly in both sync and async modes
+- Prevents device conflicts between model components
+- Is more portable across different machine configurations
 
 ## Configuration
 
@@ -46,10 +67,24 @@ The training configuration is managed through Hydra. There are two main configur
 
 ### Basic Training
 
+There are two training modes available:
+
+#### Synchronous Mode (Default)
 ```bash
-python grpo.py
+VLLM_USE_V1=0 python sota-implementations/grpo/grpo.py train_model.num_devices=2 ref_model.num_devices=2 inference_model.num_devices=2
 ```
 
+#### Asynchronous Mode (Recommended)
+```bash
+VLLM_USE_V1=0 python sota-implementations/grpo/grpo-async.py train_model.num_devices=2 ref_model.num_devices=2 inference_model.num_devices=2
+```
+
+The async mode offers better performance by:
+- Running data collection and optimization concurrently
+- More efficient GPU utilization
+- Reduced memory overhead
+- Better throughput
+
 ### Run with IFEval Config
 
 ```bash
@@ -63,7 +98,7 @@ python grpo.py --config-name grpo_ifeval
 python grpo.py env.dataset=ifeval
 
 # Modify training parameters
-python grpo.py train.epochs=2 train.optimizer.lr=2e-5
+python grpo.py optimizer.lr=2e-5 optimizer.weight_decay=0.01
 
 # Change model
 python grpo.py model.name=meta-llama/Llama-2-7b-hf
@@ -73,14 +108,16 @@ python grpo.py model.name=meta-llama/Llama-2-7b-hf
 
 ```bash
 # Learning rate sweep
-python grpo.py --multirun train.optimizer.lr=1e-4,1e-5,1e-6
+python grpo.py --multirun optimizer.lr=1e-4,1e-5,1e-6
 
 # Multiple parameters
 python grpo.py --multirun \
-  train.optimizer.lr=1e-4,1e-5 \
+  optimizer.lr=1e-4,1e-5 \
   policy.kl_coef=0.01,0.1
 ```
 
+Don't forget to set the number of value of `train.total_dialog_turns` to a reasonable value!
+
 ## Monitoring
 
 Training progress is logged to Weights & Biases with the following metrics:
@@ -91,10 +128,11 @@ Training progress is logged to Weights & Biases with the following metrics:
 - ESS (Effective Sample Size)
 - Loss metrics (objective, clip fraction, etc.)
 - Gradient norm
+- Throughput metrics (in async mode)
 
 ## Checkpointing
 
-Checkpoints are saved every `logging.checkpoint_frequency` batches and contain:
+Checkpoints are saved every `train.checkpoint_frequency` steps and contain:
 - Model state
 - Optimizer state
 - Gradient scaler state (for mixed precision)
@@ -114,8 +152,9 @@ Checkpoints are saved every `logging.checkpoint_frequency` batches and contain:
 sota-implementations/grpo/
 ├── config/
 │   └── grpo_gsm8k.yaml       # Main configuration file
-│   └── grpo_ifeval.yaml       # config file for IFEval task
-├── grpo.py            # Training script
+│   └── grpo_ifeval.yaml      # config file for IFEval task
+├── grpo.py            # Synchronous training script
+├── grpo-async.py      # Asynchronous training script
 ├── grpo_utils.py      # Utility functions
 └── README.md          # This file
 ```
 
@@ -1,4 +1,6 @@
+# @package _global_
 defaults:
+  - mode: async  # Default to async mode, will be overridden by grpo.py
   - _self_
   - override hydra/hydra_logging: disabled
   - override hydra/job_logging: disabled
@@ -17,10 +19,30 @@ model:
   name: Qwen/Qwen2.5-3B
   compile: false
 
+# Base training configuration - will be merged with mode-specific settings
+train:
+  # Fields defined in mode configs (async.yaml and sync.yaml)
+  # mixed_precision: true  # Whether to use mixed precision training
+  # epochs: 1  # Number of training epochs
+  # steps_per_batch: 32  # Number of steps per batch
+  # total_dialog_turns: 1_000_000  # Total number of dialog turns to collect
+  # optim_batch_size: 2  # Batch size for optimization
+  # gradient_accumulation_steps: 1  # Number of gradient accumulation steps
+  # kl_coef_in_loss: true  # Whether to include KL coefficient in loss
+  # sync: false  # Default to async, will be overridden by mode configs
+  # buffer_size: 128  # Size of replay buffer
+
+  # Fields used by both scripts but with different semantics
+  checkpoint_frequency: 100  # Save checkpoint every N steps/batches
+  
+  # Fields used only by grpo-async.py
+  weight_update_frequency: 50  # Update policy weights every N steps
+  logging_frequency: 10  # Log metrics every N steps
 # Training model configuration
 train_model:
   gradient_checkpointing: true  # Enabled for memory efficiency
-  devices: [0]  # List of GPU devices to use for training
+  num_devices: 1  # Number of devices to use
+  devices: null  # Will be computed by compute_device_allocation
   lora:
     enabled: true  # Using LoRA for memory efficiency
     r: 8  # LoRA rank - controls capacity of adaptations
@@ -31,57 +53,63 @@ train_model:
   attn_implementation: sdpa  # Using flash attention for memory efficiency
   torch_dtype: bfloat16
 
-# Inference model configuration (vLLM)
+# Inference model configuration
 inference_model:
-  devices: [1]  # List of GPU devices to use for inference
-  gpu_memory_utilization: 0.5
+  num_devices: 1  # Number of devices to use
+  devices: null  # Will be computed by compute_device_allocation
+  quantization:
+    enabled: false  # Enable 4-bit quantization for base model
+  attn_implementation: sdpa  # Using flash attention for memory efficiency
+  torch_dtype: bfloat16
+  gpu_memory_utilization: 0.5  # Limit GPU memory usage
   temperature: 0.8
   max_tokens: 1024
   include_stop_str_in_output: true
 
 # Reference model configuration
 ref_model:
-  devices: [2]  # List of GPU devices to use for reference model
+  gradient_checkpointing: false  # Always false, no backprop
+  num_devices: 1  # Number of devices to use
+  devices: null  # Will be computed by compute_device_allocation
+  lora:
+    enabled: true  # Using LoRA for memory efficiency
+    r: 8  # LoRA rank - controls capacity of adaptations
+    alpha: 16  # LoRA alpha - scales the adaptations
+    dropout: 0.1  # Dropout probability for LoRA layers
   quantization:
-    enabled: false  # Enable quantization for memory efficiency
-  gradient_checkpointing: false  # Not needed for reference model
-  attn_implementation: 
+    enabled: false  # Enable 4-bit quantization for base model
+  attn_implementation: sdpa  # Using flash attention for memory efficiency
   torch_dtype: bfloat16
 
 # Policy configuration
 policy:
   kl_coef: 1e-2
 
-# Training configuration
-train:
-  epochs: 1
-  # Number of dialog turns per batch. This is passed to the collector and buffer.
-  # More steps do not consume more GPU memory, but it does affect the inference speed in
-  # that in sync contexts the training node will need to wait for a batch to be completed
-  # before starting the next one.
-  steps_per_batch: 64
-  # Total number of dialog turns to collect during training
-  total_dialog_turns: 1_000_000
-  # Number of batches to run in parallel. This determines the batch size passed to the optimizer.
-  # More batches consume more GPU memory.
-  optim_batch_size: 1
-  # Number of gradient accumulation steps. This determines the number of steps to run before
-  #  updating the parameters.
-  gradient_accumulation_steps: 4  # Increased for gradient accumulation
-  # Whether to include the KL coefficient in the loss or in the environment reward.
-  kl_coef_in_loss: true
-  # Whether to use mixed precision.
-  mixed_precision: true  # Disable mixed precision since we're not using it
-  optimizer:
-    name: AdamW
-    lr: 1e-5
-    clip_grad_norm: 0.5
-
+# Optimizer configuration
+optimizer:
+  name: AdamW
+  lr: 1e-5
+  clip_grad_norm: 100.0
+  weight_decay: 0.0
+# Ray configuration
+ray:
+  init_config:
+    num_cpus: 96  # Total available CPUs
+    num_gpus: 8  # Explicitly set number of GPUs
+    runtime_env:
+      working_dir: "."
+  collector_config:
+    num_cpus: 48  # CPUs for inference and ref model (co-located)
+  train_handler_config:
+    num_cpus: 24  # Dedicated CPUs for training
+  replay_buffer_config:
+    num_cpus: 24  # CPUs for replay buffer
+    num_gpus: 0.0  # No GPU needed for replay buffer
 # Logging configuration
 logging:
-  checkpoint_dir: checkpoints
-  experiment_name: null  # auto-generated if null
-  checkpoint_frequency: 10  # save every N batches
+  experiment_name: null  # Will be auto-generated if not provided
+  checkpoint_dir: "checkpoints"
+  checkpoint_frequency: 10  # Save checkpoint every N batches
 
 hydra:
   run: