Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -159,5 +159,6 @@ cython_debug/
# option (not recommended) you can uncomment the following to ignore the entire idea folder.

# rose specific
*.db
*.pkl
asyncflow.session.*
4 changes: 4 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,10 @@ ROSE leverages [**RADICAL-Cybertools**](https://radical-cybertools.github.io), a
ROSE allows you to enable, scale, and accelerate your learning workflows across thousands of CPU cores and GPUs effectively and efficiently with just a few lines of code.
ROSE is built on the [**RADICAL-AsyncFlow**](https://radical-cybertools.github.io/radical.asyncflow/) and [**RADICAL-Pilot**](https://github.com/radical-cybertools/radical.pilot) runtime system, a powerful execution engine that enables the distributed execution of millions of scientific tasks and applications such as executables, functions and containers effortlessly.

**Preemption-safe on HPC.** Every completed iteration is written to disk before the next one starts. If the job is killed mid-run, all completed iterations are already on disk — inspect them, resume from the last checkpoint, and never rerun a finished iteration.

**Clean separation of control and observability.** Your `async for` loop contains only decisions — `break`, `set_next_config()`, application logic. Tracking (MLflow, ClearML, file-based) is wired once with `learner.add_tracker(...)` and fires automatically at every lifecycle point. No tracking code belongs in the control loop.


<figure markdown="span" style="position: relative;">
<img src="assets/rose_mind_flow.png" alt="">
Expand Down
125 changes: 125 additions & 0 deletions docs/integrations/clearml.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# ClearML Integration

ROSE ships a plug-and-play `ClearMLTracker` that wires ClearML into any learner with a
single line. For parallel learners, each sub-learner's metrics appear as separate series
inside the same task — directly overlaid in the ClearML UI for convergence comparison.

```bash
pip install rose[clearml]
```

---

## Quick start

```python
from rose.integrations.clearml_tracker import ClearMLTracker

learner.add_tracker(
ClearMLTracker(
project_name="ROSE-Materials-UQ",
task_name="ensemble-run-01",
)
)

async for state in learner.start(learner_names=["A", "B"], max_iter=15):
print(f"[{state.learner_id}] iter {state.iteration}: mse={state.metric_value:.4f}")
# tracking is fully automatic — no clearml calls here
```

Open the ClearML web UI, navigate to project `ROSE-Materials-UQ`, and select the task
`ensemble-run-01`. The **Scalars** tab shows overlaid curves per learner.

A complete runnable example is at
`examples/integrations/tracking/clearml/run_me.py`.

---

## What gets logged automatically

### Hyperparameters — logged once in `on_start`

The entire pipeline manifest is connected to the ClearML task without any user annotation:

| ClearML hyperparameter | Source |
|---|---|
| `learner_type` | Learner class name |
| `criterion/metric_name` | `as_stop_criterion(metric_name=...)` |
| `criterion/threshold` | `as_stop_criterion(threshold=...)` |
| `criterion/operator` | `as_stop_criterion(operator=...)` |
| `task/<name>/as_executable` | Per registered task |
| `task/<name>/<key>` | Explicit `log_params` dict declared in task decorator |

### Scalars — logged per iteration in `on_iteration`

| ClearML scalar | Source |
|---|---|
| `<metric_name>` (e.g. `mean_squared_error_mse`) | Stop criterion value |
| Any numeric key in `state.state` | Auto-extracted from task `dict` returns |

For **parallel learners**, `state.learner_id` is included automatically. The tracker logs
each state as a separate `series` inside the same scalar title, making per-learner curves
directly comparable without any user code.

### Task tags — logged in `on_stop`

| ClearML tag | Value |
|---|---|
| `stop:<reason>` | `stop:criterion_met` / `stop:max_iter_reached` / `stop:stopped` / `stop:error` |
| `final_iter:<n>` | Last completed iteration number |

Tags make it easy to filter tasks in the ClearML UI by outcome.

---

## Parallel learner comparison

The ClearML tracker is designed with parallel learners in mind. Each `on_iteration` call
carries `state.learner_id` — the tracker logs each learner as a separate scalar series
under the same title:

```
Scalars tab in ClearML UI:
┌─ mean_squared_error_mse ──────────────────────────────┐
│ ensemble-A ───────────\ │
│ ensemble-B ────────────\────────────────────────── │
└───────────────────────────────────────────────────────┘
```

No user code is required to achieve this — `state.learner_id` is already set by the
parallel learner framework.

---

## Multiple trackers

Attach ClearML alongside other trackers — they are independent observers:

```python
from rose.integrations.clearml_tracker import ClearMLTracker

learner.add_tracker(HPC_FileTracker("run.jsonl")) # safety net on HPC
learner.add_tracker(ClearMLTracker(project_name="x", task_name="y"))
```

---

## Extending `ClearMLTracker`

To log additional artifacts (model checkpoints, prediction plots) override `on_stop`:

```python
from rose.integrations.clearml_tracker import ClearMLTracker

class ClearMLCheckpointTracker(ClearMLTracker):
def on_stop(self, final_state, reason: str) -> None:
# Log model checkpoint as a ClearML artifact before closing the task
if final_state and self._task:
checkpoint = final_state.get("model_checkpoint")
if checkpoint:
self._task.upload_artifact(
name="best_model",
artifact_object=checkpoint,
)
super().on_stop(final_state, reason)
```
192 changes: 102 additions & 90 deletions docs/integrations/mlflow.md
Original file line number Diff line number Diff line change
@@ -1,125 +1,137 @@
# MLflow Integration

This guide demonstrates how to combine **ROSE's** workflow orchestration with **MLflow's** experiment tracking to create a robust and observable active learning system.
ROSE ships a plug-and-play `MLflowTracker` that wires MLflow into any learner with a single
line. No MLflow calls belong inside your `async for` loop.

## Overview

ROSE and MLflow provide a complementary relationship in a research or production pipeline:

* **ROSE (Orchestration):** Manages task execution order, dependencies, high-performance computing (HPC) resources, and the iterative loop.
* **MLflow (Tracking):** Records hyperparameters, performance metrics, trained models, and diagnostic plots for analysis and reproducibility.

| Tool | Role | Focus |
|------|------|-------|
| **ROSE** | Orchestrator | *What* runs? *When*? *Where*? In what order? |
| **MLflow** | Tracker | *What happened*? How well did it perform? Can I reproduce it? |
```bash
pip install rose[mlflow]
```

---

## Installation
## Quick start

```python
from rose.integrations.mlflow_tracker import MLflowTracker

# Register tasks before attaching the tracker
@learner.training_task(as_executable=False, log_params={"kernel": "rbf"})
async def train(*args, **kwargs): ...

# add_tracker fires on_start(manifest) immediately — tasks must already be registered
learner.add_tracker(
MLflowTracker(
experiment_name="surrogate-v1",
run_name="gp-adaptive-kernel", # optional
)
)

async for state in learner.start(max_iter=30):
print(f"iter {state.iteration}: mse={state.metric_value:.4f}")
# tracking is fully automatic — no mlflow calls here
```

To use this integration, you need both `mlflow` and `ROSE` installed in your environment.
View results:

```bash
# Install MLflow
pip install mlflow

# Optional: for visualization logic in the example
pip install matplotlib scikit-learn
mlflow ui --port 5000
# Open http://localhost:5000 → experiment "surrogate-v1"
```

A complete runnable example is at
`examples/integrations/tracking/mlflow/run_me_tracker.py`.

---

## Quick Start
## What gets logged automatically

You can find a complete integration example in the codebase at `examples/integrations/mlflow/mlflow_rose.py`.
### Parameters — logged once in `on_start`

```bash
# Run the integration example
python examples/integrations/mlflow/mlflow_rose.py
The entire pipeline manifest is logged as MLflow parameters without any user annotation:

# Launch the MLflow UI to view results
mlflow ui --port 5000
```
| MLflow param | Source |
|---|---|
| `learner_type` | Learner class name |
| `criterion/metric_name` | `as_stop_criterion(metric_name=...)` |
| `criterion/threshold` | `as_stop_criterion(threshold=...)` |
| `criterion/operator` | `as_stop_criterion(operator=...)` |
| `task.<name>.as_executable` | Per registered task |
| `task.<name>.<key>` | Explicit `log_params` dict declared in task decorator |

Once the UI is running, open [http://localhost:5000](http://localhost:5000) in your browser.
### Metrics — logged per iteration in `on_iteration`

---
| MLflow metric | Source |
|---|---|
| `<metric_name>` (e.g. `mean_squared_error_mse`) | Stop criterion value |
| Any scalar in `state.state` | Auto-extracted from task `dict` returns |

## Integration Pattern
Every key returned in a task's `dict` result appears as a metric — zero annotation required.

The standard pattern for integrating MLflow into a ROSE `SequentialActiveLearner` loop involves wrapping the learner's `start()` iterator:
### Tags — logged in `on_stop`

```python
import mlflow
from rose.al import SequentialActiveLearner

async def main():
# 1. Initialize MLflow Run
mlflow.set_experiment("ROSE_AL_Experiment")

with mlflow.start_run():
# 2. Log Configuration
mlflow.log_params({
"max_iterations": 10,
"mse_threshold": 0.01,
})

# 3. Setup ROSE Learner
learner = SequentialActiveLearner(asyncflow)
# ... register tasks ...

# 4. Instrument the Control Loop
async for state in learner.start(max_iter=10):
# Log metrics at each iteration step
mlflow.log_metric("mse", state.metric_value, step=state.iteration)
mlflow.log_metric("labeled_count", state.labeled_count, step=state.iteration)

print(f"Iteration {state.iteration}: MSE {state.metric_value}")

# 5. Log Final Artifacts
mlflow.sklearn.log_model(final_model, "surrogate_model")
```
| MLflow tag | Value |
|---|---|
| `stop_reason` | `"criterion_met"` / `"max_iter_reached"` / `"stopped"` / `"error"` |
| `final_iteration` | Last completed iteration number |

---

## What is Tracked?

### Parameters
Parameters are typically logged once at the beginning of the run to record the experimental setup.
* Iteration limits
* Stopping criteria thresholds
* Initial sample sizes
* Batch selection counts
## Adaptive config changes

### Metrics
Metrics are logged at **each iteration step** using the `step` parameter in `mlflow.log_metric()`. This allows you to view learning curves and performance trends over time in the MLflow UI.
* **Performance:** MSE, Accuracy, R-squared
* **Workflow State:** Number of labeled samples, remaining pool size
* **Adaptive Features:** Current uncertainty scores, selection batch sizes
When you call `learner.set_next_config(config)` to change hyperparameters between iterations,
the new config appears in the next `IterationState.current_config`. MLflow captures this
automatically in `on_iteration` — no manual `log_params()` call needed.

### Artifacts and Model Registry
At the end of the ROSE workflow, you can save:
* **The Model:** Register the final surrogate model in the MLflow Model Registry for deployment.
* **Visualizations:** Save plots of error reduction vs. iteration or sample size.
* **Data States:** Save the final labeled dataset for future reference.
```python
configs = {
0: LearnerConfig(training=TaskConfig(kwargs={"--lr": 3e-4})),
10: LearnerConfig(training=TaskConfig(kwargs={"--lr": 1e-4})),
20: LearnerConfig(training=TaskConfig(kwargs={"--lr": 3e-5})),
}

async for state in learner.start(max_iter=30):
next_iter = state.iteration + 1
if next_iter in configs:
learner.set_next_config(configs[next_iter])
# MLflow records the config change — no manual call needed
```

---

## Advanced: MLflowROSETracker Helper
## Multiple trackers

For more complex workflows, the provided example includes an `MLflowROSETracker` helper class. It encapsulates common tracking logic, making the main workflow code cleaner:
Attach MLflow alongside other trackers — they are independent observers:

```python
tracker = MLflowROSETracker("My_Complex_Experiment")
tracker.start_experiment(config)

async for state in learner.start(max_iter=15):
# Automatically handles extraction and logging of relevant metrics
tracker.log_iteration(state)
from rose.integrations.mlflow_tracker import MLflowTracker

tracker.log_model(model, X_sample, y_sample)
tracker.end_experiment(success=True)
learner.add_tracker(HPC_FileTracker("run.jsonl")) # safety net
learner.add_tracker(MLflowTracker(experiment_name="x")) # experiment comparison
```

For the full implementation of this helper, see the [mlflow_rose.py source code](https://github.com/radical-cybertools/ROSE/blob/main/examples/integrations/mlflow/mlflow_rose.py).
---

## `MLflowTracker` vs manual wiring

The previous ROSE documentation showed a manual pattern where MLflow calls were placed
inside the `async for` loop. That approach is now deprecated in favour of `add_tracker()`.

| | `MLflowTracker` | Manual wiring |
|---|---|---|
| Pipeline manifest as params | Automatic | Must write `log_params(...)` manually |
| Metrics per iteration | Automatic | Must call `log_metric(...)` inside loop |
| Stop reason tag | Automatic | Requires try/finally |
| MLflow code in control loop | None | Yes |

!!! tip
If you need to log model artifacts (e.g. `mlflow.sklearn.log_model`) or custom plots,
add that logic to a subclass of `MLflowTracker` by overriding `on_stop`:

```python
class MLflowArtifactTracker(MLflowTracker):
def on_stop(self, final_state, reason: str) -> None:
super().on_stop(final_state, reason)
if final_state and reason in ("criterion_met", "max_iter_reached"):
model = load_model(final_state.get("checkpoint_path"))
mlflow.sklearn.log_model(model, artifact_path="surrogate_model")
```
Comment on lines +132 to +137

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example for extending MLflowTracker has a small bug. It calls super().on_stop() before logging the model artifact. The base on_stop method calls mlflow.end_run(), which terminates the MLflow run. Any subsequent calls to log artifacts will either fail or start a new, separate run.

To ensure all logging happens within the same active run, the super().on_stop() call should be moved to the end of the method, after the artifact has been logged. This pattern is correctly used in the ClearMLTracker extension example.

Suggested change
def on_stop(self, final_state, reason: str) -> None:
super().on_stop(final_state, reason)
if final_state and reason in ("criterion_met", "max_iter_reached"):
model = load_model(final_state.get("checkpoint_path"))
mlflow.sklearn.log_model(model, artifact_path="surrogate_model")
```
class MLflowArtifactTracker(MLflowTracker):
def on_stop(self, final_state, reason: str) -> None:
if final_state and reason in ("criterion_met", "max_iter_reached"):
model = load_model(final_state.get("checkpoint_path"))
mlflow.sklearn.log_model(model, artifact_path="surrogate_model")
super().on_stop(final_state, reason)

Loading
Loading