-
Notifications
You must be signed in to change notification settings - Fork 606
Open
Description
Note: This is a personal design / future-work note, not an immediate priority.
Current maintenance and stability issues take precedence.
This document is intended as a reference for improving training UX once the core pipeline is solid.
Problem
The current LoRA training UI plots only raw training loss (step vs loss).
This is useful for checking optimizer health, but it is insufficient for quality decisions, because:
- Training loss can continue decreasing while model quality plateaus or overfits
- Users cannot easily identify the best checkpoint
- There is no visual signal for when further training has diminishing returns
Requested Improvements
Graph Metrics Expansion
- Keep raw training loss
- Add smoothed training loss (EMA or moving average)
- Add validation loss (points or curve) at configurable intervals
- Optionally add learning rate curve
- Secondary y-axis, or
- Separate subplot
- Add best-checkpoint marker
- Vertical line and/or highlighted point at best validation step or epoch
Run Summary Annotations
Display key stats near or overlaid on the graph:
- Latest smoothed training loss
- Best validation loss + step / epoch
- Elapsed time and optional ETA
- Optional run metadata:
- LoRA rank
- Learning rate
- Batch size
- Dataset name
Best Checkpoint Handling
- Track best validation metric during training
- Save or copy the best checkpoint to a stable path
- Example:
output/checkpoints/best/
- Example:
- Continue saving:
- Periodic epoch checkpoints (
epoch_N) - Final checkpoint
- Periodic epoch checkpoints (
- Update export flow to allow explicit selection:
- best
- latest
- final
- manual epoch path
Auto-Stop Support (Optional)
- Add early stopping based on validation metric:
patiencemin_deltaeval_interval
- Keep disabled by default
- Make fully configurable in UI
Implementation Notes
- Current code already provides:
- Step / loss
- Time tracking
- Periodic checkpointing
- Validation metrics and best-checkpoint tracking are not yet implemented
- For richer overlays, markers, and annotations:
- Consider migrating from
gr.LinePlottogr.Plot - Use Plotly or Matplotlib for multi-layer graphs
- Consider migrating from
Acceptance Criteria
- Graph displays:
- Raw training loss
- Smoothed training loss
- Validation loss
- Best checkpoint is:
- Clearly indicated on the graph
- Saved to a dedicated
bestpath
- User can export or select the best checkpoint directly from the UI
- Early stopping:
- Works when enabled
- Logs the reason for stopping
- Existing training flow remains fully backward compatible when new options are disabled
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels