Skip to content

Commit e39d852

Browse files
authored
[fix] wordle model_id updates (#4453)
1 parent 0d57110 commit e39d852

File tree

2 files changed

+4
-4
lines changed

2 files changed

+4
-4
lines changed

docs/source/openenv.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -358,7 +358,7 @@ The example requires two GPUs:
358358

359359
```bash
360360
# Terminal 1: Start vLLM inference server
361-
CUDA_VISIBLE_DEVICES=0 trl vllm-serve --model Qwen/Qwen2.5-0.5B-Instruct --host 0.0.0.0 --port 8000
361+
CUDA_VISIBLE_DEVICES=0 trl vllm-serve --model Qwen/Qwen3-1.7B --host 0.0.0.0 --port 8000
362362

363363
# Terminal 2: Run GRPO training with OpenEnv
364364
CUDA_VISIBLE_DEVICES=1 python examples/scripts/openenv/wordle.py
@@ -368,6 +368,6 @@ CUDA_VISIBLE_DEVICES=1 python examples/scripts/openenv/wordle.py
368368

369369
The resulting model improves it's performance on the game, both by reducing the number of repetitions and by increasing the number of correct guesses. However, the the Qwen3-1.7B model we trained is not able to consistently win the game. The following reward curve shows the coverage of the model's guesses and the coverage of correct Y and G letters.
370370

371-
<iframe src="https://burtenshaw-wordle-grpo.hf.space/?project=group-Qwen-Qwen3-17B&metrics=train/rewards/reward_coverage/mean&runs=run-2025-10-26_09-39-49&sidebar=hidden&navbar=hidden" style="width:600px; height:500px; border:0;"></iframe>
371+
<iframe src="https://burtenshaw-wordle-grpo.hf.space?project=group-Qwen-Qwen3-17B&metrics=reward&runs=run-2025-10-26_09-39-49,run-2025-10-26_08-04-49&sidebar=hidden&navbar=hidden" style="width:1600px; height:500px; border:0;"></iframe>
372372

373373
We experimented larger models like `gpt-oss-20b` and found that model was able to consistently win the game. However, this requires a lot of compute to train and the model. Why not try this out yourself?

examples/scripts/openenv/wordle.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121
python -m src.envs.textarena_env.server.app
2222
2323
# Start the vLLM server with your model
24-
CUDA_VISIBLE_DEVICES=0 trl vllm-serve --model Qwen/Qwen2.5-0.5B-Instruct --host 0.0.0.0 --port 8000
24+
CUDA_VISIBLE_DEVICES=0 trl vllm-serve --model Qwen/Qwen3-1.7B --host 0.0.0.0 --port 8000
2525
2626
# Then run this training script:
2727
CUDA_VISIBLE_DEVICES=1 python examples/scripts/openenv/wordle.py
@@ -66,7 +66,7 @@ def parse_args() -> argparse.Namespace:
6666
)
6767
parser.add_argument(
6868
"--model-id",
69-
default="willcb/Qwen3-1.7B-Wordle",
69+
default="Qwen/Qwen3-1.7B",
7070
help="Model identifier passed to GRPOTrainer for fine-tuning.",
7171
)
7272
parser.add_argument(

0 commit comments

Comments
 (0)