This is the official code for the paper SkillFactory: Self-Distillation for Learning Cognitive Behaviors. In this repo, we provide the code for extracting reflections and verification data and reformatting them into SkillFactory traces.
Additionally, all links to all models, evaluations, training datasets, and more can be found in our Huggingface org, links above!
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt# This runs data creation with reflection traces using default parameters
bash examples/example.sh --input-dataset SkillFactory/RAW_DATA-countdown3args-Qwen2.5-1.5B-Instruct --output-dataset SFT-TrainingData-Example_Output_Repo
# You can also specify custom input and output datasets:
bash examples/example.sh --input-dataset "your-input-dataset" --output-dataset "your-output-dataset"
# Or use short flags:
bash examples/example.sh -i "your-input-dataset" -o "your-output-dataset"Our data creation code is provided in skill_factory/sft_data_creation.py, and we have example scripts in examples/.
For an example of what the dataset should look like, see SkillFactory/RAW_DATA-countdown3args-Qwen2.5-1.5B-Instruct.
| Column | Type | Description |
|---|---|---|
model_responses__example_annotation |
string |
A single model generation that attempts to answer the question. Must contain <answer> tags with the final answer. |
model_responses__example_annotation__eval_is_correct |
bool[] |
Single-element array indicating whether the response was correct (True) or incorrect (False). |
model_responses__mutated_prompts_reflection |
string[] |
Array of N reflection strings, each containing <verdict>Correct</verdict> or <verdict>Incorrect</verdict> tags. These are paired with the eval column to filter for reflections that correctly identify response validity. |
prompt_type |
string (optional) |
Identifier for the prompt variant used. Useful when generating multiple responses per question with different prompts (up to 8 variants) to avoid oversampling. |
Our SFT was performed using LLaMA-Factory and our RL was performed using verl . We do not include these modules in this repo release, but we recommend you forking and running from those.
For SFT training, you can create the SFT dataset via sft_data_creation.py which will upload a huggingface repo. Take that repo and plug it into your data/datasets.json within LLaMA-Factory.
{
...
"example_dataset": {
"hf_hub_url": "hf_url",
"formatting": "sharegpt",
"columns": {
"messages": "conversations"
},
"tags": {
"user_tag": "user",
"assistant_tag": "assistant",
"role_tag": "role",
"content_tag": "content"
},
"subset": "sft_train"
}
}Once done, you can use the template yaml file in examples/llamafactory_sft_example.yaml for training with LLaMA-Factory.
Here is an example command we used to train our models with Verl
python -m verl.trainer.main_ppo \
\
# === Trainer / run config ===
trainer.total_epochs=50 \
trainer.save_freq=25 \
trainer.test_freq=25 \
trainer.val_before_train=True \
trainer.logger=[console,wandb] \
trainer.project_name=SkillFactory \
trainer.experiment_name=experiment_name \
trainer.nnodes=1 \
trainer.n_gpus_per_node=4 \
trainer.default_local_dir=/path/to/verl/checkpoints \
\
# === Algorithm (PPO / GRPO etc.) ===
algorithm.adv_estimator=grpo \
algorithm.kl_ctrl.kl_coef=0.001 \
\
# === Data ===
data.train_files=/path/to/data/train.parquet \
data.val_files=/path/to/data/val.parquet \
data.train_batch_size=256 \
data.max_prompt_length=512 \
data.max_response_length=4096 \
\
# === Rollout config ===
actor_rollout_ref.rollout.n=16 \
actor_rollout_ref.rollout.max_num_batched_tokens=16384 \
actor_rollout_ref.rollout.max_num_seqs=2048 \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
actor_rollout_ref.rollout.dtype=bfloat16 \
\
# === Actor / ref model optimization ===
actor_rollout_ref.actor.optim.lr=1e-06 \
actor_rollout_ref.actor.ppo_mini_batch_size=32 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
actor_rollout_ref.actor.strategy=fsdp2 \
actor_rollout_ref.actor.fsdp_config.forward_prefetch=True \
actor_rollout_ref.ref.fsdp_config.forward_prefetch=True \
\
# === Policy / critic models ===
actor_rollout_ref.model.path=/path/to/prefetched_model \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.model.enable_activation_offload=True \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.model.trust_remote_code=True \
\
critic.model.path=/path/to/prefetched_model \
critic.optim.lr=1e-05 \
critic.ppo_micro_batch_size_per_gpu=1 \
critic.model.trust_remote_code=True \
\
# === Reward model / custom reward ===
reward_model.reward_manager=batch \
reward_model.launch_reward_fn_async=True \
reward_model.model.fsdp_config.forward_prefetch=True \
custom_reward_function.name=compute_score_batch \
custom_reward_function.path=/path/to/verl/rewards_fn \
\
# === Hydra output dirs ===
hydra.run.dir=/path/to/workflow_out/hydra \
hydra.output_subdir=null \
hydra.job.chdir=FalseOur reward function extracted text inside <answer> tags and checked for equivalence of a gold label. This is very similar to the examples given by the verl repo.
@article{sprague2025skillfactory,
title={SkillFactory: Self-Distillation for Learning Cognitive Behaviors},
author={Sprague, Zayne and Lu, Jack and Wadhwa, Manya and Keh, Sedrick and Ren, Mengye and Durrett, Greg},
journal={arXiv preprint 2512.04072},
year={2025}
}
