feat(fine-tuning): add Prometheus monitoring and LoRA serving for fine-tuned models#369
feat(fine-tuning): add Prometheus monitoring and LoRA serving for fine-tuned models#369
Conversation
- Finalizer now always encrypts locally first (TEE backup) - Then attempts upload to 0G Storage (unless skipStorageUpload=true) - If storage upload fails, falls back to local hash gracefully - Encrypted file is always retained for TEE download fallback - Removes old encryptAndUploadModel (split into encrypt + upload) Co-authored-by: Cursor <cursoragent@cursor.com>
Previously DownloadFromStorage always assumed ZIP format, causing "zip: not a valid zip file" when CLI uploads raw JSONL to 0G Storage. Changes: - Detect file type via magic bytes before attempting unzip - For non-ZIP files (raw JSONL), move directly to target path - Convert raw JSONL to HuggingFace DatasetDict format after download (required by token counter and training executor) Co-authored-by: Cursor <cursoragent@cursor.com>
- DownloadFromStorage: detect file type via magic bytes, handle non-ZIP - Clean up existing target before rename on retry - Convert raw JSONL to HF DatasetDict format after download - Auto-calculate fee when fee=0 (broker-calculated fee mode) - Verify signature with original fee to avoid mismatch Co-authored-by: Cursor <cursoragent@cursor.com>
- [HIGH] Use passed ctx instead of context.Background() in Download call for proper cancellation/timeout propagation - [HIGH] Add defer cleanup for temp download file to prevent disk leaks on early error returns - [MEDIUM] Improve isZipFile() to return errors on I/O failures instead of silently treating them as non-ZIP files Co-authored-by: Cursor <cursoragent@cursor.com>
- Upgrade 0g-storage-client v1.2.1 → v1.2.2 to fix ABI incompatibility with testnet Flow contract (was causing `execution reverted; data: 0x`) - Remove deprecated WithRoutines() calls (now set internally via UploaderConfig in v1.2.2) - Fix uploadModel() to return raw 32-byte hash (hash.Bytes()) instead of hex string bytes ([]byte(hash.Hex())), which caused double-encoding via hexutil.Encode and prevented client from downloading model via 0G Storage Tested: Full flow verified with Qwen2.5-0.5B and Qwen3-32B (1.1GB model) — broker upload, client download from 0G Storage, decrypt, LoRA verified. Co-authored-by: Cursor <cursoragent@cursor.com>
…e-tuned models Implements two features for the fine-tuning service: 1. Prometheus Monitoring (#368): - Add fine-tuning specific metrics: task lifecycle, storage operations, settlement, HTTP requests, and unique users tracking - Configurable via `monitor.enable` in config - Instrument worker phases, storage client, settlement, and task handler 2. Serve Fine-tuned Models to End-users (#367): - Add LoRA serving module with vLLM integration for shared base model serving with multiple concurrent LoRA adapters - Auto-discover finished tasks and register LoRA adapters for inference - Private key authentication with model ownership access control - GPU isolation via CUDA_VISIBLE_DEVICES for inference vs training - OpenAI-compatible API endpoints with streaming support - Health endpoint and automatic adapter lifecycle management Co-authored-by: Cursor <cursoragent@cursor.com>
|
Claude finished @zeyuyuyu's task in 2m 29s —— View job Code Review Complete ✅I've completed a comprehensive review of the Prometheus monitoring and LoRA serving features. Overall, this is a well-structured implementation with good separation of concerns. Below are my findings organized by severity. 🔴 Critical Issues1. [CRITICAL] Global Environment Variable MutationFile: if m.config.InferenceGPUIDs != "" {
os.Setenv("CUDA_VISIBLE_DEVICES", m.config.InferenceGPUIDs)
}Issue: Setting Fix: Set environment variables only for the vLLM subprocess: cmd := exec.CommandContext(ctx, "vllm", args...)
cmd.Env = append(os.Environ(), "CUDA_VISIBLE_DEVICES="+m.config.InferenceGPUIDs)🟠 High Priority Issues2. [HIGH] Goroutine Leak in MonitorFile: go processUniqueUsers()Issue: The Fix: Pass a context and handle cancellation: func Init(serverName string, ctx context.Context) {
// ...
go processUniqueUsers(ctx)
}
func processUniqueUsers(ctx context.Context) {
// ...
for {
select {
case <-ctx.Done():
return
case userAddress := <-uniqueUsersChan:
// ... process user
}
}
}3. [HIGH] Panic in Metrics ServerFile: func StartMetricsServer(address string) {
// ...
if err := r.Run(address); err != nil {
panic(err) // ❌ Should return error instead
}
}Issue: Using Recommendation: This function appears unused in the codebase (metrics are served via main HTTP server). Consider removing it or properly returning errors. 4. [HIGH] vLLM Process Cleanup MissingFile: cmd := exec.CommandContext(ctx, "vllm", args...)
// ...
if err := cmd.Start(); err != nil {
return // Process started but not tracked for cleanup
}Issue: If the vLLM process starts but encounters errors later, it may not be properly killed. Fix: Add proper cleanup in Manager: func (m *Manager) Stop() error {
m.mu.Lock()
defer m.mu.Unlock()
if m.vllmProcess != nil && m.vllmProcess.Process != nil {
if err := m.vllmProcess.Process.Kill(); err != nil {
return errors.Wrap(err, "kill vLLM process")
}
}
return nil
}Then call this from 5. [HIGH] Unchecked Error in Proxy ResponseFile: respBody, _ := io.ReadAll(resp.Body) // Error ignored
c.Writer.Write(respBody)Issue: Ignoring read errors can cause incomplete responses to be sent to users. Fix: respBody, err := io.ReadAll(resp.Body)
if err != nil {
p.logger.Errorf("failed to read response: %v", err)
c.JSON(http.StatusBadGateway, gin.H{"error": "Failed to read backend response"})
return
}
c.Writer.Write(respBody)6. [HIGH] Unchecked File Removal ErrorsFiles:
os.Remove(destDir) // Error ignoredIssue: Failing to remove symlinks can cause disk space issues and stale state. Fix: if err := os.Remove(destDir); err != nil && !os.IsNotExist(err) {
m.logger.Warnf("failed to remove model symlink %s: %v", destDir, err)
}🟡 Medium Priority Issues7. [MEDIUM] Missing Pagination for Serving TasksFile: func (d *DB) GetFinishedTasksForServing() ([]Task, error) {
var tasks []Task
ret := d.db.Where("progress IN ?", servableStates).Order("created_at DESC").Find(&tasks)
// No LIMIT clauseIssue: This query can return unlimited rows, causing memory issues as tasks accumulate. Fix: Add pagination or a reasonable limit: ret := d.db.Where("progress IN ?", servableStates).
Order("created_at DESC").
Limit(1000). // Or make configurable
Find(&tasks)8. [MEDIUM] Model Name Collisions PossibleFile: func (m *Manager) makeModelName(baseModel string, taskID uuid.UUID) string {
shortBase := baseModel
if len(shortBase) > 16 {
shortBase = shortBase[:16] // Truncation can cause collisions
}
// ...
return fmt.Sprintf("ft-%s-%s", shortBase, taskID.String()[:8]) // Only 8 chars of UUID
}Issue: Using only 8 characters of the UUID significantly increases collision probability. Two tasks with the same base model prefix could collide. Fix: Use the full UUID or increase to 12-16 characters: return fmt.Sprintf("ft-%s-%s", shortBase, taskID.String()[:12])9. [MEDIUM] Hardcoded HTTP TimeoutsFiles:
Issue: The proxy's HTTP client has no timeout, which can cause goroutine leaks if backend hangs. Fix: client: &http.Client{
Timeout: 5 * time.Minute, // Allow for long inference requests
Transport: &http.Transport{
MaxIdleConns: 100,
IdleConnTimeout: 90 * time.Second,
DisableCompression: false,
},
},10. [MEDIUM] Channel Buffer Size ArbitraryFile: uniqueUsersChan = make(chan string, 10000) // Why 10000?Issue: Hardcoded buffer size with no documentation or configurability. Recommendation: Either make it configurable via config or document the sizing rationale. 11. [MEDIUM] No Context Propagation in Health CheckFile: func (m *Manager) waitForVLLMReady(ctx context.Context) {
// ...
resp, err := m.httpClient.Get(endpoint + "/health") // No contextFix: req, _ := http.NewRequestWithContext(ctx, "GET", endpoint+"/health", nil)
resp, err := m.httpClient.Do(req)🔵 Low Priority / Best Practices12. [LOW] Registry Contract Registration is No-OpFile: func (r *Registry) registerOnContract(ctx context.Context, model *ServedModel) error {
// ... logs only, no actual contract call
return nil
}Issue: The comment says "Full inference contract registration would require a separate inference contract instance, which is out of scope." This is confusing. Fix: Either:
13. [LOW] Streaming Could Handle Errors BetterFile: for {
n, readErr := resp.Body.Read(buf)
if n > 0 {
if _, writeErr := c.Writer.Write(buf[:n]); writeErr != nil {
p.logger.Warnf("stream write error: %v", writeErr)
return // Abrupt termination
}Recommendation: Consider sending an error SSE event before returning to inform the client of the failure. 14. [LOW] Task State Poller Not TrackedFile: go startTaskStatePoller(ctx, svc.db, logger)Issue: Goroutine is launched but not tracked in any sync.WaitGroup for graceful shutdown. Recommendation: While context cancellation handles this, consider using a WaitGroup for cleaner shutdown: var wg sync.WaitGroup
wg.Add(1)
go func() {
defer wg.Done()
startTaskStatePoller(ctx, svc.db, logger)
}()
// ... on shutdown ...
wg.Wait()⚪ Nitpicks15. [nit] Missing godoc for Exported FunctionsFile: Several exported functions lack documentation:
Fix: Add godoc comments following Go conventions: // RecordTaskCreated increments the counter for total tasks created.
func RecordTaskCreated() {
if TasksCreatedTotal != nil {
TasksCreatedTotal.Inc()
}
}16. [nit] Inconsistent Error WrappingFile: Some functions use Recommendation: Be consistent with error wrapping for better debugging: return errors.New("LoRA serving is not enabled") // Instead of plain errors.New✅ Good Practices ObservedExcellent work on these aspects:
📝 SummaryOverall Assessment: This is a solid implementation with good architecture. The main concerns are around resource management (goroutine leaks, process cleanup) and some edge cases in error handling. Priority Recommendations:
Testing Recommendations:
Documentation Needed:
Todo List
|
…toring-and-serving Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # api/fine-tuning/config/config.go # api/fine-tuning/internal/handler/handler.go # api/fine-tuning/internal/storage/client.go
- Add context cancellation to monitor.processUniqueUsers goroutine to prevent leaks on shutdown - Add Manager.Stop() method to gracefully terminate vLLM process - Use http.NewRequestWithContext for health check to respect cancellation - Add HTTP client timeout (5min) to proxy for inference requests - Check os.RemoveAll errors when pruning/unregistering model directories - Check c.Writer.Write error in non-streaming proxy responses - Increase model name UUID suffix from 8 to 12 chars for uniqueness - Add godoc comments to all exported monitor functions - Remove unused promhttp import from monitor package - Add missing ProviderStake and SupportedPredefinedModels config fields from origin/main merge Co-authored-by: Cursor <cursoragent@cursor.com>
…anup duplicate Stop - Add godoc comments to all exported types and functions in serving package - Add .Limit(1000) to GetFinishedTasksForServing query to prevent unbounded results - Remove duplicate Stop() method (keep the graceful signal-based one) - Rewrite registerOnContract TODO comment for clarity Co-authored-by: Cursor <cursoragent@cursor.com>
|
Claude finished @zeyuyuyu's task in 4m 9s —— View job Code Review Complete ✅I've completed a thorough review of the Prometheus monitoring and LoRA serving implementation. This is a well-implemented feature with solid architecture and good attention to detail. I can see that many issues from a previous review were addressed (great job!). Below are my findings. 🟢 Excellent Improvements from Previous ReviewI noticed many critical issues were fixed:
🔴 High Priority Issues1. [HIGH] Temporary Directory for LoRA Modules May Not PersistFiles: // manager.go
if loraDir == "" {
loraDir = "/tmp/lora-modules" // ❌ /tmp may be cleared on reboot
}
// config.go
LoraModulesDir: "/tmp/lora-modules", // ❌ Not persistentIssue: Using
Fix: Use a persistent directory: if loraDir == "" {
loraDir = filepath.Join(utils.GetDataDir(), "lora-modules") // Persistent
}
// Or in config.go
LoraModulesDir: "./data/lora-modules",Recommendation: Document this in deployment docs and make it configurable with a sensible default outside 2. [HIGH] GetVLLMModels Doesn't Accept ContextFile: func (m *Manager) GetVLLMModels() ([]string, error) {
resp, err := m.httpClient.Get(m.GetVLLMEndpoint() + "/v1/models") // ❌ No context
if err != nil {
return nil, err
}Issue: This function doesn't accept a context parameter, making it impossible to:
Fix: func (m *Manager) GetVLLMModels(ctx context.Context) ([]string, error) {
req, err := http.NewRequestWithContext(ctx, http.MethodGet, m.GetVLLMEndpoint()+"/v1/models", nil)
if err != nil {
return nil, err
}
resp, err := m.httpClient.Do(req)
if err != nil {
return nil, err
}
defer resp.Body.Close()
// ... rest of the code
}🟡 Medium Priority Issues3. [MEDIUM] Model Name Collision Risk Still ExistsFile: func (m *Manager) makeModelName(baseModel string, taskID uuid.UUID) string {
shortBase := baseModel
if len(shortBase) > 16 {
shortBase = shortBase[:16] // Truncation
}
// ...
return fmt.Sprintf("ft-%s-%s", shortBase, taskID.String()[:12]) // Only 12 chars of UUID
}Issue: While improved from 8 chars, using only 12 characters of the UUID still has collision probability:
Fix: Use more UUID characters or full UUID: return fmt.Sprintf("ft-%s-%s", shortBase, taskID.String()[:16]) // 16 chars = 64 bits
// Or just use the full UUID (no collision risk):
return fmt.Sprintf("ft-%s-%s", shortBase, taskID.String())Rationale: Model names are internal identifiers, not user-facing. Length is less important than uniqueness. 4. [MEDIUM] Hardcoded Database Query LimitFile: ret := d.db.Where("progress IN ?", servableStates).
Order("created_at DESC").
Limit(1000). // ❌ Hardcoded
Find(&tasks)Issue: The 1000-task limit is hardcoded, which:
Fix: Make it configurable: // In config/config.go
type ServingConfig struct {
// ... existing fields ...
MaxDiscoverableTasks int `yaml:"maxDiscoverableTasks"`
}
// With default in GetConfig():
MaxDiscoverableTasks: 1000,
// In db/service.go:
func (d *DB) GetFinishedTasksForServing(limit int) ([]Task, error) {
if limit <= 0 {
limit = 1000 // Sensible default
}
ret := d.db.Where("progress IN ?", servableStates).
Order("created_at DESC").
Limit(limit).
Find(&tasks)
// ...
}5. [MEDIUM] Streaming Error Handling Could Be More GracefulFile: for {
n, readErr := resp.Body.Read(buf)
if n > 0 {
if _, writeErr := c.Writer.Write(buf[:n]); writeErr != nil {
p.logger.Warnf("stream write error: %v", writeErr)
return // ❌ Abrupt termination, client may not know why
}
c.Writer.Flush()
}
if readErr != nil {
if readErr != io.EOF {
p.logger.Warnf("stream read error: %v", readErr)
}
return
}
}Issue: When a streaming write fails, the connection is abruptly closed without sending an error event to the client. Recommendation: For SSE streams, consider sending an error event before closing: if _, writeErr := c.Writer.Write(buf[:n]); writeErr != nil {
p.logger.Warnf("stream write error: %v", writeErr)
// Try to send error event (may fail if connection is broken, but worth trying)
errorEvent := fmt.Sprintf("data: {\"error\": \"Stream interrupted\"}\n\n")
c.Writer.Write([]byte(errorEvent))
c.Writer.Flush()
return
}This is a nice-to-have rather than critical, as clients should handle unexpected disconnections anyway. 🔵 Low Priority / Best Practices6. [LOW] Missing Godoc Comments for Exported FunctionsFile: Exported functions lack documentation:
Fix: Add godoc comments following Go conventions: // RecordTaskCreated increments the counter for total fine-tuning tasks created.
// This metric tracks task creation across all users and includes both successful
// and failed task initialization attempts.
func RecordTaskCreated() {
if TasksCreatedTotal != nil {
TasksCreatedTotal.Inc()
}
}
// RecordTaskCompleted increments the counter for successfully completed fine-tuning tasks.
// A task is considered completed when it reaches the Finished state after successful settlement.
func RecordTaskCompleted() {
if TasksCompletedTotal != nil {
TasksCompletedTotal.Inc()
}
}7. [LOW] Arbitrary Channel Buffer SizeFile: uniqueUsersChan = make(chan string, 10000) // Why 10000?Issue: The buffer size of 10,000 has no documented rationale. Recommendation: Either:
8. [LOW] Registry Module is Essentially a No-OpFile: func (r *Registry) registerOnContract(ctx context.Context, model *ServedModel) error {
r.logger.Infof("registering fine-tuned model on contract: name=%s, ...", model.ModelName)
// TODO: When fine-tuning and inference brokers share a contract interface, this method
// should call contract.AddOrUpdateService() to register the LoRA model...
r.logger.Infof("model %s marked as registered for inference serving...", model.ModelName)
return nil // ❌ Does nothing
}Issue: The entire
Recommendation: Either:
9. [LOW] Task State Poller Not Tracked for Clean ShutdownFile: go startTaskStatePoller(ctx, svc.db, logger)Issue: While the goroutine handles context cancellation, it's not tracked in a Recommendation: Use a WaitGroup for cleaner shutdown: var wg sync.WaitGroup
// ... in runApplication:
if cfg.Monitor.Enable {
monitor.Init(cfg.Service.ServingUrl, ctx)
engine.GET("/metrics", gin.WrapH(promhttp.Handler()))
engine.Use(monitor.TrackMetrics())
wg.Add(1)
go func() {
defer wg.Done()
startTaskStatePoller(ctx, svc.db, logger)
}()
}
// ... before return in runApplication:
logger.Info("waiting for background tasks to finish...")
wg.Wait()This ensures metrics are fully flushed before shutdown. 10. [nit] Inconsistent Error WrappingExamples:
Recommendation: Be consistent with error wrapping for better debugging: return fmt.Errorf("LoRA serving is not enabled")
return fmt.Errorf("model not found: %s", modelName)Or use ⚪ Nitpicks11. [nit] Variable Naming in makeModelNameFile: func (m *Manager) makeModelName(baseModel string, taskID uuid.UUID) string {
shortBase := baseModel
if len(shortBase) > 16 {
shortBase = shortBase[:16]
}
shortBase = strings.Map(func(r rune) rune { // ❓ What does this do?
if (r >= 'a' && r <= 'z') || (r >= 'A' && r <= 'Z') || (r >= '0' && r <= '9') || r == '-' || r == '_' {
return r
}
return '-'
}, shortBase)
return fmt.Sprintf("ft-%s-%s", shortBase, taskID.String()[:12])
}Recommendation: Add a comment explaining the sanitization: func (m *Manager) makeModelName(baseModel string, taskID uuid.UUID) string {
shortBase := baseModel
if len(shortBase) > 16 {
shortBase = shortBase[:16]
}
// Sanitize model name: replace non-alphanumeric chars (except - and _) with hyphens
// This ensures vLLM-compatible model identifiers
shortBase = strings.Map(func(r rune) rune {
if (r >= 'a' && r <= 'z') || (r >= 'A' && r <= 'Z') || (r >= '0' && r <= '9') || r == '-' || r == '_' {
return r
}
return '-'
}, shortBase)
return fmt.Sprintf("ft-%s-%s", shortBase, taskID.String()[:12])
}✅ Excellent Practices ObservedWell Done! 👏
📝 SummaryOverall Assessment: This is a solid, production-ready implementation with good architecture and attention to detail. The monitoring integration is clean, and the serving module provides a valuable feature for users. Priority Recommendations:
Testing Recommendations:
Documentation Needed:
Review Checklist
Recommendation: ✅ Approve with minor revisions. Address high-priority issues (#1, #2) before merging to production, but the code is solid enough for staging deployment. |
- Add comment explaining uniqueUsersChan buffer sizing rationale - Add godoc + sanitization comment to makeModelName - Use http.NewRequestWithContext in GetVLLMModels for proper context propagation - Use fmt.Errorf for string-formatted errors (consistent error style) - Add startup log to Registry clarifying local-only tracking - Wrap task state poller goroutine in sync.WaitGroup for clean shutdown Co-authored-by: Cursor <cursoragent@cursor.com>
…e offload/restore Adds a four-tier caching system for LoRA adapters: - GPU (hot): vLLM native LRU via --max-loras - CPU (warm): vLLM native via --max-cpu-loras - Disk (cool): dynamic loading via lora_filesystem_resolver plugin - 0G Storage (cold): automatic offload of inactive adapters and on-demand restore Key changes: - New model_cache.go with ModelState enum, StorageDownloader interface, offload loop, and async restore logic - Manager tracks LastAccessedAt/State/OutputRootHash per model - vLLM startup includes --max-cpu-loras and filesystem resolver env vars - Proxy returns HTTP 202 with loading status for archived/loading models - Health endpoint reports cache tier statistics - Config adds MaxCpuLoras, OffloadAfterMinutes, EnableColdStorage fields Made-with: Cursor
…caching 17 tests covering: - Model registration, state tracking, and access recording - Offload logic (stale models, skip without hash, skip recent) - Restore logic (async download, idempotent for active/loading) - Unregister, prune, ownership checks, model name generation - Full offload→restore cycle end-to-end Made-with: Cursor
Covers architecture, multi-tier caching strategy (GPU→CPU→Disk→0G Storage), API reference, configuration guide, authentication, deployment, and testing results from end-to-end validation on NVIDIA H20. Made-with: Cursor
…ry leak vLLM 0.16.0 uses a multi-process architecture (APIServer + EngineCore). Previously Stop() only sent SIGINT to the main process, leaving EngineCore as an orphan still holding GPU memory. This caused GPU OOM for other users. Fix: start vLLM with Setpgid=true and kill the entire process group (-pid) with SIGTERM/SIGKILL on shutdown. Also adds an E2E serving test program that validates the full flow: MySQL -> Manager -> vLLM -> Proxy -> authenticated inference requests. All 11 test cases passed on H20 GPU with 3 concurrent LoRA adapters. Made-with: Cursor
Instead of always returning HTTP 202 when a model is archived/loading, clients can now set wait_for_model=true in the request body to block until the model is ready, similar to ServerlessLLM's approach. - Add WaitForModel with channel-based notification mechanism - Add configurable ModelLoadTimeoutSeconds (default 300s) - Add E2E test cases for wait_for_model feature - Add 6 new unit tests covering wait scenarios Made-with: Cursor
Allows configuring --gpu-memory-utilization for vLLM to coexist with other GPU processes on shared machines. Made-with: Cursor
Summary
wait_for_model: trueto block until a cold-storage model finishes loading instead of polling HTTP 202 (inspired by ServerlessLLM)gpuMemoryUtilizationparameter for vLLM to coexist with other GPU processesArchitecture Overview
System Components
Multi-tier Caching: How Each Tier Works
--max-loras)--max-cpu-loras)lora_filesystem_resolverpluginmodel_cache.go)"status": "loading"Key insight: The first three tiers (GPU, CPU, Disk) are entirely managed by vLLM's internal LRU cache and are invisible to the user -- requests always succeed immediately. Only the fourth tier (0G Storage) requires user-side retry logic.
Request Flow: LoRA on Disk (the common case)
When a LoRA adapter is on disk but not in GPU/CPU memory (state =
active):Request Flow: LoRA Archived in 0G Storage
When a LoRA adapter has been offloaded (state =
archived):Automatic Model Discovery
The Manager runs a background poller (
pollFinishedTasks) that:progress = "Finished"or"UserAcknowledged"loraModulesDirpointing to the LoRA outputactivefilesystem_resolverautomatically discovers the new symlinkThis means fine-tuning output is automatically available for inference with zero manual intervention.
vLLM Process Group Cleanup (Bug Fix)
vLLM 0.16.0+ uses a multi-process architecture:
Previously,
Manager.Stop()only sent SIGINT to the main process, leaving EngineCore alive as an orphan still holding GPU memory. This caused GPU OOM errors for other users sharing the machine.Fix: Start vLLM with
Setpgid: trueto create an independent process group, then kill the entire group (syscall.Kill(-pid, SIGTERM)) on shutdown.Authentication
Users authenticate via EIP-191 personal signatures:
"0g-serving-inference-auth"with their Ethereum private keyAuthorization: Bearer 0x<signature>Configuration
Test Results
Unit Tests: 17/17 passed
Covers model registration, state tracking, access recording, offload/restore logic, ownership verification, and the full offload-restore cycle.
E2E Tests: 13/13 passed (NVIDIA H20 GPU)
Ran on server with Qwen2.5-0.5B-Instruct base model + 3 LoRA adapters:
wait_for_model=trueon active modelmodel_load_timeout_secUnit Tests: 23/23 passed
Including 6 new tests for
WaitForModel:TestWaitForModelAlreadyActive— instant return for active modelsTestWaitForModelRestoreCompletes— blocks then returns after restoreTestWaitForModelTimeout— respects timeout deadlineTestWaitForModelContextCancelled— respects context cancellationTestWaitForModelMultipleWaiters— 3 concurrent waiters all wake upTestWaitForModelRestoreFails— returns Archived state on failureKey observations:
filesystem_resolverdynamically loaded LoRA adapters on first requestwait_for_model=trueworks transparently for already-active modelsFiles Changed
serving/manager.goserving/proxy.goserving/model_cache.goserving/registry.goserving/serving_test.gocmd/e2e-serving-test/main.godb/test_helpers.goconfig/config.gocmd/server/main.gomonitor/docs/LORA_SERVING.md