Full parameter fine-tuning sampler sub-system by droot · Pull Request #129 · gke-labs/open-rl

droot · 2026-06-17T21:14:07Z

feat(k8s,vllm): introduce decoupled distributed reinforcement learning and vLLM time-slicing infrastructure

Overview

This PR introduces horizontally scalable, decoupled microservices architecture for running distributed Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) workloads on Kubernetes. By leveraging Kubernetes Dynamic Resource Allocation (DRA) exact allocation claims and cooperative VRAM yield, this design virtualizes GPU silicon to support high-density concurrent training and sampling experiments across shared hardware fleets.

Key Architectural Highlights

Decoupled Workload Orchestration: Separates policy gradient calculation (Trainers) from rollout generation (Samplers). The central Gateway service dynamically generates isolated Pod manifests per runtime session without requiring static worker deployments.
Dynamic Resource Allocation (DRA) Pinning: Replaces standard exclusive GPU locks with exact-allocation device claims. Concurrent workloads co-schedule onto shared silicon up to configured concurrency limits.
Strict Role Segregation: Introduces dedicated node affinity groups (trainers vs samplers) to guarantee physical hardware isolation between PyTorch AdamW optimizers and vLLM KV caches, preventing CUDA out-of-memory contention.
Cooperative Time-Slicing & NFS Weight Sync: Virtualizes active GPU VRAM across multi-tenant inference engines. Sampler engines cooperatively sleep VRAM between batches and reload safetensor weights in-place in ~1.1 seconds when training updates synchronize over managed shared NFS storage.
Host Memory Oversubscription Tuning: Optimizes worker CPU RAM requests (16GiB) to safely pack multiple concurrent distributed training jobs onto standard 48GiB host machines without scheduling deadlocks.

End-to-End Verifications Achieved

10-Step Distributed RL Verification: Successfully executed distributed reinforcement learning loops against live KRM clusters, verifying seamless loss drop and policy convergence to 1.00 mean reward (100% accuracy).
Dual Concurrent Workload Multiplexing: Verified parallel execution of concurrent RL experiments sharing physical GPU silicon, confirming stable CRIU process swapping, cooperative inference yield, and zero-error cleanup upon session teardown.

…ding

…ee memory discovery

…tion benchmarks

…nifests

…disable Kustomize ConfigMap hashing

…L jobs on 48GiB nodes

…and time-slicing architecture

…e dual DRA nodes

ShubyM

Left one small comment otherwise looks good!

droot added 11 commits June 17, 2026 14:11

ci: fix runner disk space exhaustion and exclude scratch artifacts

36b0bcf

feat(server): implement pull sampling queues and dynamic weight reloa…

bdfe564

…ding

feat(snapshot): implement Group Coordination and automated process tr…

4539449

…ee memory discovery

feat(training): coordinate SFT/vLLM time-slicing and add E2E verifica…

dd180e9

…tion benchmarks

feat(k8s): add vLLM sampler worker node pool and dual pod template ma…

b5241cf

…nifests

docs(k8s): add step-by-step DRA experiment setup and smoke test guide

1bd07c4

fix(k8s,sampler): resolve UnboundLocalError on shutdown sentinel and …

c25ac9c

…disable Kustomize ConfigMap hashing

tune(k8s): lower memory requests to 16Gi to support dual concurrent R…

e0c9181

…L jobs on 48GiB nodes

docs(fft): add comprehensive walkthrough of Kubernetes pod placement …

7171c5a

…and time-slicing architecture

chore: ignore docs/scratch directory in .gitignore

3badd1a

docs(setup): document decoupled vLLM sampler architecture and allocat…

abe45e4

…e dual DRA nodes

droot marked this pull request as ready for review June 18, 2026 18:49

droot requested a review from ShubyM June 18, 2026 18:49

ShubyM reviewed Jun 18, 2026

View reviewed changes

Comment thread k8s/deploy/distributed-fft-timeslice/04-gateway.yaml

droot merged commit 12d3fac into gke-labs:fft Jun 18, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Full parameter fine-tuning sampler sub-system #129

Full parameter fine-tuning sampler sub-system #129
droot merged 11 commits into
gke-labs:fftfrom
droot:fft-sampler-layered

droot commented Jun 17, 2026 •

edited

Loading

Uh oh!

ShubyM left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

droot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Key Architectural Highlights

End-to-End Verifications Achieved

Uh oh!

ShubyM left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

droot commented Jun 17, 2026 •

edited

Loading