-
Notifications
You must be signed in to change notification settings - Fork 4
Experiment tracking #86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
4464421
This commit adds:
AymenFJA caac48a
This commit adds:
AymenFJA cd05d38
This commit:
AymenFJA e75046f
This commit:
AymenFJA 1279d14
consider per iter config in tracker
AymenFJA fac96eb
This commit:
AymenFJA 811862b
This commit improve stop reason for each tracker
AymenFJA File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,125 @@ | ||
| # ClearML Integration | ||
|
|
||
| ROSE ships a plug-and-play `ClearMLTracker` that wires ClearML into any learner with a | ||
| single line. For parallel learners, each sub-learner's metrics appear as separate series | ||
| inside the same task — directly overlaid in the ClearML UI for convergence comparison. | ||
|
|
||
| ```bash | ||
| pip install rose[clearml] | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Quick start | ||
|
|
||
| ```python | ||
| from rose.integrations.clearml_tracker import ClearMLTracker | ||
|
|
||
| learner.add_tracker( | ||
| ClearMLTracker( | ||
| project_name="ROSE-Materials-UQ", | ||
| task_name="ensemble-run-01", | ||
| ) | ||
| ) | ||
|
|
||
| async for state in learner.start(learner_names=["A", "B"], max_iter=15): | ||
| print(f"[{state.learner_id}] iter {state.iteration}: mse={state.metric_value:.4f}") | ||
| # tracking is fully automatic — no clearml calls here | ||
| ``` | ||
|
|
||
| Open the ClearML web UI, navigate to project `ROSE-Materials-UQ`, and select the task | ||
| `ensemble-run-01`. The **Scalars** tab shows overlaid curves per learner. | ||
|
|
||
| A complete runnable example is at | ||
| `examples/integrations/tracking/clearml/run_me.py`. | ||
|
|
||
| --- | ||
|
|
||
| ## What gets logged automatically | ||
|
|
||
| ### Hyperparameters — logged once in `on_start` | ||
|
|
||
| The entire pipeline manifest is connected to the ClearML task without any user annotation: | ||
|
|
||
| | ClearML hyperparameter | Source | | ||
| |---|---| | ||
| | `learner_type` | Learner class name | | ||
| | `criterion/metric_name` | `as_stop_criterion(metric_name=...)` | | ||
| | `criterion/threshold` | `as_stop_criterion(threshold=...)` | | ||
| | `criterion/operator` | `as_stop_criterion(operator=...)` | | ||
| | `task/<name>/as_executable` | Per registered task | | ||
| | `task/<name>/<key>` | Explicit `log_params` dict declared in task decorator | | ||
|
|
||
| ### Scalars — logged per iteration in `on_iteration` | ||
|
|
||
| | ClearML scalar | Source | | ||
| |---|---| | ||
| | `<metric_name>` (e.g. `mean_squared_error_mse`) | Stop criterion value | | ||
| | Any numeric key in `state.state` | Auto-extracted from task `dict` returns | | ||
|
|
||
| For **parallel learners**, `state.learner_id` is included automatically. The tracker logs | ||
| each state as a separate `series` inside the same scalar title, making per-learner curves | ||
| directly comparable without any user code. | ||
|
|
||
| ### Task tags — logged in `on_stop` | ||
|
|
||
| | ClearML tag | Value | | ||
| |---|---| | ||
| | `stop:<reason>` | `stop:criterion_met` / `stop:max_iter_reached` / `stop:stopped` / `stop:error` | | ||
| | `final_iter:<n>` | Last completed iteration number | | ||
|
|
||
| Tags make it easy to filter tasks in the ClearML UI by outcome. | ||
|
|
||
| --- | ||
|
|
||
| ## Parallel learner comparison | ||
|
|
||
| The ClearML tracker is designed with parallel learners in mind. Each `on_iteration` call | ||
| carries `state.learner_id` — the tracker logs each learner as a separate scalar series | ||
| under the same title: | ||
|
|
||
| ``` | ||
| Scalars tab in ClearML UI: | ||
| ┌─ mean_squared_error_mse ──────────────────────────────┐ | ||
| │ ensemble-A ───────────\ │ | ||
| │ ensemble-B ────────────\────────────────────────── │ | ||
| └───────────────────────────────────────────────────────┘ | ||
| ``` | ||
|
|
||
| No user code is required to achieve this — `state.learner_id` is already set by the | ||
| parallel learner framework. | ||
|
|
||
| --- | ||
|
|
||
| ## Multiple trackers | ||
|
|
||
| Attach ClearML alongside other trackers — they are independent observers: | ||
|
|
||
| ```python | ||
| from rose.integrations.clearml_tracker import ClearMLTracker | ||
|
|
||
| learner.add_tracker(HPC_FileTracker("run.jsonl")) # safety net on HPC | ||
| learner.add_tracker(ClearMLTracker(project_name="x", task_name="y")) | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Extending `ClearMLTracker` | ||
|
|
||
| To log additional artifacts (model checkpoints, prediction plots) override `on_stop`: | ||
|
|
||
| ```python | ||
| from rose.integrations.clearml_tracker import ClearMLTracker | ||
|
|
||
| class ClearMLCheckpointTracker(ClearMLTracker): | ||
| def on_stop(self, final_state, reason: str) -> None: | ||
| # Log model checkpoint as a ClearML artifact before closing the task | ||
| if final_state and self._task: | ||
| checkpoint = final_state.get("model_checkpoint") | ||
| if checkpoint: | ||
| self._task.upload_artifact( | ||
| name="best_model", | ||
| artifact_object=checkpoint, | ||
| ) | ||
| super().on_stop(final_state, reason) | ||
| ``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,125 +1,137 @@ | ||
| # MLflow Integration | ||
|
|
||
| This guide demonstrates how to combine **ROSE's** workflow orchestration with **MLflow's** experiment tracking to create a robust and observable active learning system. | ||
| ROSE ships a plug-and-play `MLflowTracker` that wires MLflow into any learner with a single | ||
| line. No MLflow calls belong inside your `async for` loop. | ||
|
|
||
| ## Overview | ||
|
|
||
| ROSE and MLflow provide a complementary relationship in a research or production pipeline: | ||
|
|
||
| * **ROSE (Orchestration):** Manages task execution order, dependencies, high-performance computing (HPC) resources, and the iterative loop. | ||
| * **MLflow (Tracking):** Records hyperparameters, performance metrics, trained models, and diagnostic plots for analysis and reproducibility. | ||
|
|
||
| | Tool | Role | Focus | | ||
| |------|------|-------| | ||
| | **ROSE** | Orchestrator | *What* runs? *When*? *Where*? In what order? | | ||
| | **MLflow** | Tracker | *What happened*? How well did it perform? Can I reproduce it? | | ||
| ```bash | ||
| pip install rose[mlflow] | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Installation | ||
| ## Quick start | ||
|
|
||
| ```python | ||
| from rose.integrations.mlflow_tracker import MLflowTracker | ||
|
|
||
| # Register tasks before attaching the tracker | ||
| @learner.training_task(as_executable=False, log_params={"kernel": "rbf"}) | ||
| async def train(*args, **kwargs): ... | ||
|
|
||
| # add_tracker fires on_start(manifest) immediately — tasks must already be registered | ||
| learner.add_tracker( | ||
| MLflowTracker( | ||
| experiment_name="surrogate-v1", | ||
| run_name="gp-adaptive-kernel", # optional | ||
| ) | ||
| ) | ||
|
|
||
| async for state in learner.start(max_iter=30): | ||
| print(f"iter {state.iteration}: mse={state.metric_value:.4f}") | ||
| # tracking is fully automatic — no mlflow calls here | ||
| ``` | ||
|
|
||
| To use this integration, you need both `mlflow` and `ROSE` installed in your environment. | ||
| View results: | ||
|
|
||
| ```bash | ||
| # Install MLflow | ||
| pip install mlflow | ||
|
|
||
| # Optional: for visualization logic in the example | ||
| pip install matplotlib scikit-learn | ||
| mlflow ui --port 5000 | ||
| # Open http://localhost:5000 → experiment "surrogate-v1" | ||
| ``` | ||
|
|
||
| A complete runnable example is at | ||
| `examples/integrations/tracking/mlflow/run_me_tracker.py`. | ||
|
|
||
| --- | ||
|
|
||
| ## Quick Start | ||
| ## What gets logged automatically | ||
|
|
||
| You can find a complete integration example in the codebase at `examples/integrations/mlflow/mlflow_rose.py`. | ||
| ### Parameters — logged once in `on_start` | ||
|
|
||
| ```bash | ||
| # Run the integration example | ||
| python examples/integrations/mlflow/mlflow_rose.py | ||
| The entire pipeline manifest is logged as MLflow parameters without any user annotation: | ||
|
|
||
| # Launch the MLflow UI to view results | ||
| mlflow ui --port 5000 | ||
| ``` | ||
| | MLflow param | Source | | ||
| |---|---| | ||
| | `learner_type` | Learner class name | | ||
| | `criterion/metric_name` | `as_stop_criterion(metric_name=...)` | | ||
| | `criterion/threshold` | `as_stop_criterion(threshold=...)` | | ||
| | `criterion/operator` | `as_stop_criterion(operator=...)` | | ||
| | `task.<name>.as_executable` | Per registered task | | ||
| | `task.<name>.<key>` | Explicit `log_params` dict declared in task decorator | | ||
|
|
||
| Once the UI is running, open [http://localhost:5000](http://localhost:5000) in your browser. | ||
| ### Metrics — logged per iteration in `on_iteration` | ||
|
|
||
| --- | ||
| | MLflow metric | Source | | ||
| |---|---| | ||
| | `<metric_name>` (e.g. `mean_squared_error_mse`) | Stop criterion value | | ||
| | Any scalar in `state.state` | Auto-extracted from task `dict` returns | | ||
|
|
||
| ## Integration Pattern | ||
| Every key returned in a task's `dict` result appears as a metric — zero annotation required. | ||
|
|
||
| The standard pattern for integrating MLflow into a ROSE `SequentialActiveLearner` loop involves wrapping the learner's `start()` iterator: | ||
| ### Tags — logged in `on_stop` | ||
|
|
||
| ```python | ||
| import mlflow | ||
| from rose.al import SequentialActiveLearner | ||
|
|
||
| async def main(): | ||
| # 1. Initialize MLflow Run | ||
| mlflow.set_experiment("ROSE_AL_Experiment") | ||
|
|
||
| with mlflow.start_run(): | ||
| # 2. Log Configuration | ||
| mlflow.log_params({ | ||
| "max_iterations": 10, | ||
| "mse_threshold": 0.01, | ||
| }) | ||
|
|
||
| # 3. Setup ROSE Learner | ||
| learner = SequentialActiveLearner(asyncflow) | ||
| # ... register tasks ... | ||
|
|
||
| # 4. Instrument the Control Loop | ||
| async for state in learner.start(max_iter=10): | ||
| # Log metrics at each iteration step | ||
| mlflow.log_metric("mse", state.metric_value, step=state.iteration) | ||
| mlflow.log_metric("labeled_count", state.labeled_count, step=state.iteration) | ||
|
|
||
| print(f"Iteration {state.iteration}: MSE {state.metric_value}") | ||
|
|
||
| # 5. Log Final Artifacts | ||
| mlflow.sklearn.log_model(final_model, "surrogate_model") | ||
| ``` | ||
| | MLflow tag | Value | | ||
| |---|---| | ||
| | `stop_reason` | `"criterion_met"` / `"max_iter_reached"` / `"stopped"` / `"error"` | | ||
| | `final_iteration` | Last completed iteration number | | ||
|
|
||
| --- | ||
|
|
||
| ## What is Tracked? | ||
|
|
||
| ### Parameters | ||
| Parameters are typically logged once at the beginning of the run to record the experimental setup. | ||
| * Iteration limits | ||
| * Stopping criteria thresholds | ||
| * Initial sample sizes | ||
| * Batch selection counts | ||
| ## Adaptive config changes | ||
|
|
||
| ### Metrics | ||
| Metrics are logged at **each iteration step** using the `step` parameter in `mlflow.log_metric()`. This allows you to view learning curves and performance trends over time in the MLflow UI. | ||
| * **Performance:** MSE, Accuracy, R-squared | ||
| * **Workflow State:** Number of labeled samples, remaining pool size | ||
| * **Adaptive Features:** Current uncertainty scores, selection batch sizes | ||
| When you call `learner.set_next_config(config)` to change hyperparameters between iterations, | ||
| the new config appears in the next `IterationState.current_config`. MLflow captures this | ||
| automatically in `on_iteration` — no manual `log_params()` call needed. | ||
|
|
||
| ### Artifacts and Model Registry | ||
| At the end of the ROSE workflow, you can save: | ||
| * **The Model:** Register the final surrogate model in the MLflow Model Registry for deployment. | ||
| * **Visualizations:** Save plots of error reduction vs. iteration or sample size. | ||
| * **Data States:** Save the final labeled dataset for future reference. | ||
| ```python | ||
| configs = { | ||
| 0: LearnerConfig(training=TaskConfig(kwargs={"--lr": 3e-4})), | ||
| 10: LearnerConfig(training=TaskConfig(kwargs={"--lr": 1e-4})), | ||
| 20: LearnerConfig(training=TaskConfig(kwargs={"--lr": 3e-5})), | ||
| } | ||
|
|
||
| async for state in learner.start(max_iter=30): | ||
| next_iter = state.iteration + 1 | ||
| if next_iter in configs: | ||
| learner.set_next_config(configs[next_iter]) | ||
| # MLflow records the config change — no manual call needed | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Advanced: MLflowROSETracker Helper | ||
| ## Multiple trackers | ||
|
|
||
| For more complex workflows, the provided example includes an `MLflowROSETracker` helper class. It encapsulates common tracking logic, making the main workflow code cleaner: | ||
| Attach MLflow alongside other trackers — they are independent observers: | ||
|
|
||
| ```python | ||
| tracker = MLflowROSETracker("My_Complex_Experiment") | ||
| tracker.start_experiment(config) | ||
|
|
||
| async for state in learner.start(max_iter=15): | ||
| # Automatically handles extraction and logging of relevant metrics | ||
| tracker.log_iteration(state) | ||
| from rose.integrations.mlflow_tracker import MLflowTracker | ||
|
|
||
| tracker.log_model(model, X_sample, y_sample) | ||
| tracker.end_experiment(success=True) | ||
| learner.add_tracker(HPC_FileTracker("run.jsonl")) # safety net | ||
| learner.add_tracker(MLflowTracker(experiment_name="x")) # experiment comparison | ||
| ``` | ||
|
|
||
| For the full implementation of this helper, see the [mlflow_rose.py source code](https://github.com/radical-cybertools/ROSE/blob/main/examples/integrations/mlflow/mlflow_rose.py). | ||
| --- | ||
|
|
||
| ## `MLflowTracker` vs manual wiring | ||
|
|
||
| The previous ROSE documentation showed a manual pattern where MLflow calls were placed | ||
| inside the `async for` loop. That approach is now deprecated in favour of `add_tracker()`. | ||
|
|
||
| | | `MLflowTracker` | Manual wiring | | ||
| |---|---|---| | ||
| | Pipeline manifest as params | Automatic | Must write `log_params(...)` manually | | ||
| | Metrics per iteration | Automatic | Must call `log_metric(...)` inside loop | | ||
| | Stop reason tag | Automatic | Requires try/finally | | ||
| | MLflow code in control loop | None | Yes | | ||
|
|
||
| !!! tip | ||
| If you need to log model artifacts (e.g. `mlflow.sklearn.log_model`) or custom plots, | ||
| add that logic to a subclass of `MLflowTracker` by overriding `on_stop`: | ||
|
|
||
| ```python | ||
| class MLflowArtifactTracker(MLflowTracker): | ||
| def on_stop(self, final_state, reason: str) -> None: | ||
| super().on_stop(final_state, reason) | ||
| if final_state and reason in ("criterion_met", "max_iter_reached"): | ||
| model = load_model(final_state.get("checkpoint_path")) | ||
| mlflow.sklearn.log_model(model, artifact_path="surrogate_model") | ||
| ``` | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example for extending
MLflowTrackerhas a small bug. It callssuper().on_stop()before logging the model artifact. The baseon_stopmethod callsmlflow.end_run(), which terminates the MLflow run. Any subsequent calls to log artifacts will either fail or start a new, separate run.To ensure all logging happens within the same active run, the
super().on_stop()call should be moved to the end of the method, after the artifact has been logged. This pattern is correctly used in theClearMLTrackerextension example.