Skip to content

Feature: Improve LoRA Training Graph with Validation-Aware Signals and Best-Checkpoint Selection #421

@1larity

Description

@1larity

Note: This is a personal design / future-work note, not an immediate priority.
Current maintenance and stability issues take precedence.
This document is intended as a reference for improving training UX once the core pipeline is solid.


Problem

The current LoRA training UI plots only raw training loss (step vs loss).

This is useful for checking optimizer health, but it is insufficient for quality decisions, because:

  • Training loss can continue decreasing while model quality plateaus or overfits
  • Users cannot easily identify the best checkpoint
  • There is no visual signal for when further training has diminishing returns

Requested Improvements

Graph Metrics Expansion

Image
  • Keep raw training loss
  • Add smoothed training loss (EMA or moving average)
  • Add validation loss (points or curve) at configurable intervals
  • Optionally add learning rate curve
    • Secondary y-axis, or
    • Separate subplot
  • Add best-checkpoint marker
    • Vertical line and/or highlighted point at best validation step or epoch

Run Summary Annotations

Display key stats near or overlaid on the graph:

  • Latest smoothed training loss
  • Best validation loss + step / epoch
  • Elapsed time and optional ETA
  • Optional run metadata:
    • LoRA rank
    • Learning rate
    • Batch size
    • Dataset name

Best Checkpoint Handling

  • Track best validation metric during training
  • Save or copy the best checkpoint to a stable path
    • Example: output/checkpoints/best/
  • Continue saving:
    • Periodic epoch checkpoints (epoch_N)
    • Final checkpoint
  • Update export flow to allow explicit selection:
    • best
    • latest
    • final
    • manual epoch path

Auto-Stop Support (Optional)

  • Add early stopping based on validation metric:
    • patience
    • min_delta
    • eval_interval
  • Keep disabled by default
  • Make fully configurable in UI

Implementation Notes

  • Current code already provides:
    • Step / loss
    • Time tracking
    • Periodic checkpointing
  • Validation metrics and best-checkpoint tracking are not yet implemented
  • For richer overlays, markers, and annotations:
    • Consider migrating from gr.LinePlot to gr.Plot
    • Use Plotly or Matplotlib for multi-layer graphs

Acceptance Criteria

  • Graph displays:
    • Raw training loss
    • Smoothed training loss
    • Validation loss
  • Best checkpoint is:
    • Clearly indicated on the graph
    • Saved to a dedicated best path
  • User can export or select the best checkpoint directly from the UI
  • Early stopping:
    • Works when enabled
    • Logs the reason for stopping
  • Existing training flow remains fully backward compatible when new options are disabled

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions