Skip to content

LoRA serving: Observability — Prometheus metrics and logging #390

@zeyuyuyu

Description

@zeyuyuyu

Context

Parent issue: #367 (Serve fine-tuned models to end-users)
Code review on PR #379 identified missing observability as a medium-priority gap.

Scope

Add Prometheus metrics and structured logging for LoRA adapter lifecycle:

Metrics

  • lora_adapters_total (gauge): Number of adapters by state (active/loading/offloaded/failed)
  • lora_adapter_deploy_duration_seconds (histogram): Time to download + decrypt + deploy
  • lora_adapter_operations_total (counter): Operations by type (deploy/offload/restore/fail)
  • lora_event_watcher_lag_blocks (gauge): current_block - last_processed_block
  • lora_storage_download_bytes_total (counter): Bytes downloaded from 0G Storage
  • lora_inference_requests_total (counter): Requests by adapter name

Logging

  • Structured fields for adapter operations (taskID, userAddress, adapterName)
  • Request tracing through proxy → ownership check → backend

Acceptance Criteria

  • Metrics exposed on /metrics endpoint
  • Grafana dashboard template for LoRA serving
  • Alert rules for failed deployments and high event watcher lag

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions