Skip to content

Full parameter fine-tuning sampler sub-system #129

Merged
droot merged 11 commits into
gke-labs:fftfrom
droot:fft-sampler-layered
Jun 18, 2026
Merged

Full parameter fine-tuning sampler sub-system #129
droot merged 11 commits into
gke-labs:fftfrom
droot:fft-sampler-layered

Conversation

@droot

@droot droot commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

feat(k8s,vllm): introduce decoupled distributed reinforcement learning and vLLM time-slicing infrastructure

Overview

This PR introduces horizontally scalable, decoupled microservices architecture for running distributed Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) workloads on Kubernetes. By leveraging Kubernetes Dynamic Resource Allocation (DRA) exact allocation claims and cooperative VRAM yield, this design virtualizes GPU silicon to support high-density concurrent training and sampling experiments across shared hardware fleets.

Key Architectural Highlights

  • Decoupled Workload Orchestration: Separates policy gradient calculation (Trainers) from rollout generation (Samplers). The central Gateway service dynamically generates isolated Pod manifests per runtime session without requiring static worker deployments.
  • Dynamic Resource Allocation (DRA) Pinning: Replaces standard exclusive GPU locks with exact-allocation device claims. Concurrent workloads co-schedule onto shared silicon up to configured concurrency limits.
  • Strict Role Segregation: Introduces dedicated node affinity groups (trainers vs samplers) to guarantee physical hardware isolation between PyTorch AdamW optimizers and vLLM KV caches, preventing CUDA out-of-memory contention.
  • Cooperative Time-Slicing & NFS Weight Sync: Virtualizes active GPU VRAM across multi-tenant inference engines. Sampler engines cooperatively sleep VRAM between batches and reload safetensor weights in-place in ~1.1 seconds when training updates synchronize over managed shared NFS storage.
  • Host Memory Oversubscription Tuning: Optimizes worker CPU RAM requests (16GiB) to safely pack multiple concurrent distributed training jobs onto standard 48GiB host machines without scheduling deadlocks.

End-to-End Verifications Achieved

  • 10-Step Distributed RL Verification: Successfully executed distributed reinforcement learning loops against live KRM clusters, verifying seamless loss drop and policy convergence to 1.00 mean reward (100% accuracy).
  • Dual Concurrent Workload Multiplexing: Verified parallel execution of concurrent RL experiments sharing physical GPU silicon, confirming stable CRIU process swapping, cooperative inference yield, and zero-error cleanup upon session teardown.

@droot droot marked this pull request as ready for review June 18, 2026 18:49
@droot droot requested a review from ShubyM June 18, 2026 18:49

@ShubyM ShubyM left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one small comment otherwise looks good!

Comment thread k8s/deploy/distributed-fft-timeslice/04-gateway.yaml
@droot droot merged commit 12d3fac into gke-labs:fft Jun 18, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants