radical-cybertools · AymenFJA · Mar 12, 2026 · Mar 9, 2026 · Mar 10, 2026 · Mar 10, 2026
diff --git a/.gitignore b/.gitignore
@@ -159,5 +159,6 @@ cython_debug/
 #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
 
 # rose specific
+*.db
 *.pkl
 asyncflow.session.*
diff --git a/docs/index.md b/docs/index.md
@@ -11,6 +11,10 @@ ROSE leverages [**RADICAL-Cybertools**](https://radical-cybertools.github.io), a
 ROSE allows you to enable, scale, and accelerate your learning workflows across thousands of CPU cores and GPUs effectively and efficiently with just a few lines of code.
 ROSE is built on the [**RADICAL-AsyncFlow**](https://radical-cybertools.github.io/radical.asyncflow/) and [**RADICAL-Pilot**](https://github.com/radical-cybertools/radical.pilot) runtime system, a powerful execution engine that enables the distributed execution of millions of scientific tasks and applications such as executables, functions and containers effortlessly.
 
+**Preemption-safe on HPC.** Every completed iteration is written to disk before the next one starts. If the job is killed mid-run, all completed iterations are already on disk — inspect them, resume from the last checkpoint, and never rerun a finished iteration.
+
+**Clean separation of control and observability.** Your `async for` loop contains only decisions — `break`, `set_next_config()`, application logic. Tracking (MLflow, ClearML, file-based) is wired once with `learner.add_tracker(...)` and fires automatically at every lifecycle point. No tracking code belongs in the control loop.
+
 
 <figure markdown="span" style="position: relative;">
   <img src="assets/rose_mind_flow.png" alt="">

diff --git a/docs/integrations/clearml.md b/docs/integrations/clearml.md
@@ -0,0 +1,125 @@
+# ClearML Integration
+
+ROSE ships a plug-and-play `ClearMLTracker` that wires ClearML into any learner with a
+single line. For parallel learners, each sub-learner's metrics appear as separate series
+inside the same task — directly overlaid in the ClearML UI for convergence comparison.
+
+```bash
+pip install rose[clearml]
+```
+
+---
+
+## Quick start
+
+```python
+from rose.integrations.clearml_tracker import ClearMLTracker
+
+learner.add_tracker(
+    ClearMLTracker(
+        project_name="ROSE-Materials-UQ",
+        task_name="ensemble-run-01",
+    )
+)
+
+async for state in learner.start(learner_names=["A", "B"], max_iter=15):
+    print(f"[{state.learner_id}] iter {state.iteration}: mse={state.metric_value:.4f}")
+    # tracking is fully automatic — no clearml calls here
+```
+
+Open the ClearML web UI, navigate to project `ROSE-Materials-UQ`, and select the task
+`ensemble-run-01`. The **Scalars** tab shows overlaid curves per learner.
+
+A complete runnable example is at
+`examples/integrations/tracking/clearml/run_me.py`.
+
+---
+
+## What gets logged automatically
+
+### Hyperparameters — logged once in `on_start`
+
+The entire pipeline manifest is connected to the ClearML task without any user annotation:
+
+| ClearML hyperparameter | Source |
+|---|---|
+| `learner_type` | Learner class name |
+| `criterion/metric_name` | `as_stop_criterion(metric_name=...)` |
+| `criterion/threshold` | `as_stop_criterion(threshold=...)` |
+| `criterion/operator` | `as_stop_criterion(operator=...)` |
+| `task/<name>/as_executable` | Per registered task |
+| `task/<name>/<key>` | Explicit `log_params` dict declared in task decorator |
+
+### Scalars — logged per iteration in `on_iteration`
+
+| ClearML scalar | Source |
+|---|---|
+| `<metric_name>` (e.g. `mean_squared_error_mse`) | Stop criterion value |
+| Any numeric key in `state.state` | Auto-extracted from task `dict` returns |
+
+For **parallel learners**, `state.learner_id` is included automatically. The tracker logs
+each state as a separate `series` inside the same scalar title, making per-learner curves
+directly comparable without any user code.
+
+### Task tags — logged in `on_stop`
+
+| ClearML tag | Value |
+|---|---|
+| `stop:<reason>` | `stop:criterion_met` / `stop:max_iter_reached` / `stop:stopped` / `stop:error` |
+| `final_iter:<n>` | Last completed iteration number |
+
+Tags make it easy to filter tasks in the ClearML UI by outcome.
+
+---
+
+## Parallel learner comparison
+
+The ClearML tracker is designed with parallel learners in mind. Each `on_iteration` call
+carries `state.learner_id` — the tracker logs each learner as a separate scalar series
+under the same title:
+
+```
+Scalars tab in ClearML UI:
+  ┌─ mean_squared_error_mse ──────────────────────────────┐
+  │  ensemble-A  ───────────\                             │
+  │  ensemble-B  ────────────\──────────────────────────  │
+  └───────────────────────────────────────────────────────┘
+```
+
+No user code is required to achieve this — `state.learner_id` is already set by the
+parallel learner framework.
+
+---
+
+## Multiple trackers
+
+Attach ClearML alongside other trackers — they are independent observers:
+
+```python
+from rose.integrations.clearml_tracker import ClearMLTracker
+
+learner.add_tracker(HPC_FileTracker("run.jsonl"))   # safety net on HPC
+learner.add_tracker(ClearMLTracker(project_name="x", task_name="y"))
+```
+
+---
+
+## Extending `ClearMLTracker`
+
+To log additional artifacts (model checkpoints, prediction plots) override `on_stop`:
+
+```python
+from rose.integrations.clearml_tracker import ClearMLTracker
+
+class ClearMLCheckpointTracker(ClearMLTracker):
+    def on_stop(self, final_state, reason: str) -> None:
+        # Log model checkpoint as a ClearML artifact before closing the task
+        if final_state and self._task:
+            checkpoint = final_state.get("model_checkpoint")
+            if checkpoint:
+                self._task.upload_artifact(
+                    name="best_model",
+                    artifact_object=checkpoint,
+                )
+        super().on_stop(final_state, reason)
+```
diff --git a/docs/integrations/mlflow.md b/docs/integrations/mlflow.md
@@ -1,125 +1,137 @@
 # MLflow Integration
 
-This guide demonstrates how to combine **ROSE's** workflow orchestration with **MLflow's** experiment tracking to create a robust and observable active learning system.
+ROSE ships a plug-and-play `MLflowTracker` that wires MLflow into any learner with a single
+line. No MLflow calls belong inside your `async for` loop.
 
-## Overview
-
-ROSE and MLflow provide a complementary relationship in a research or production pipeline:
-
-*   **ROSE (Orchestration):** Manages task execution order, dependencies, high-performance computing (HPC) resources, and the iterative loop.
-*   **MLflow (Tracking):** Records hyperparameters, performance metrics, trained models, and diagnostic plots for analysis and reproducibility.
-
-| Tool | Role | Focus |
-|------|------|-------|
-| **ROSE** | Orchestrator | *What* runs? *When*? *Where*? In what order? |
-| **MLflow** | Tracker | *What happened*? How well did it perform? Can I reproduce it? |
+```bash
+pip install rose[mlflow]
+```
 
 ---
 
-## Installation
+## Quick start
+
+```python
+from rose.integrations.mlflow_tracker import MLflowTracker
+
+# Register tasks before attaching the tracker
+@learner.training_task(as_executable=False, log_params={"kernel": "rbf"})
+async def train(*args, **kwargs): ...
+
+# add_tracker fires on_start(manifest) immediately — tasks must already be registered
+learner.add_tracker(
+    MLflowTracker(
+        experiment_name="surrogate-v1",
+        run_name="gp-adaptive-kernel",   # optional
+    )
+)
+
+async for state in learner.start(max_iter=30):
+    print(f"iter {state.iteration}: mse={state.metric_value:.4f}")
+    # tracking is fully automatic — no mlflow calls here
+```
 
-To use this integration, you need both `mlflow` and `ROSE` installed in your environment.
+View results:
 
 ```bash
-# Install MLflow
-pip install mlflow
-
-# Optional: for visualization logic in the example
-pip install matplotlib scikit-learn
+mlflow ui --port 5000
+# Open http://localhost:5000 → experiment "surrogate-v1"
 ```
 
+A complete runnable example is at
+`examples/integrations/tracking/mlflow/run_me_tracker.py`.
+
 ---
 
-## Quick Start
+## What gets logged automatically
 
-You can find a complete integration example in the codebase at `examples/integrations/mlflow/mlflow_rose.py`.
+### Parameters — logged once in `on_start`
 
-```bash
-# Run the integration example
-python examples/integrations/mlflow/mlflow_rose.py
+The entire pipeline manifest is logged as MLflow parameters without any user annotation:
 
-# Launch the MLflow UI to view results
-mlflow ui --port 5000
-```
+| MLflow param | Source |
+|---|---|
+| `learner_type` | Learner class name |
+| `criterion/metric_name` | `as_stop_criterion(metric_name=...)` |
+| `criterion/threshold` | `as_stop_criterion(threshold=...)` |
+| `criterion/operator` | `as_stop_criterion(operator=...)` |
+| `task.<name>.as_executable` | Per registered task |
+| `task.<name>.<key>` | Explicit `log_params` dict declared in task decorator |
 
-Once the UI is running, open [http://localhost:5000](http://localhost:5000) in your browser.
+### Metrics — logged per iteration in `on_iteration`
 
----
+| MLflow metric | Source |
+|---|---|
+| `<metric_name>` (e.g. `mean_squared_error_mse`) | Stop criterion value |
+| Any scalar in `state.state` | Auto-extracted from task `dict` returns |
 
-## Integration Pattern
+Every key returned in a task's `dict` result appears as a metric — zero annotation required.
 
-The standard pattern for integrating MLflow into a ROSE `SequentialActiveLearner` loop involves wrapping the learner's `start()` iterator:
+### Tags — logged in `on_stop`
 
-```python
-import mlflow
-from rose.al import SequentialActiveLearner
-
-async def main():
-    # 1. Initialize MLflow Run
-    mlflow.set_experiment("ROSE_AL_Experiment")
-
-    with mlflow.start_run():
-        # 2. Log Configuration
-        mlflow.log_params({
-            "max_iterations": 10,
-            "mse_threshold": 0.01,
-        })
-
-        # 3. Setup ROSE Learner
-        learner = SequentialActiveLearner(asyncflow)
-        # ... register tasks ...
-
-        # 4. Instrument the Control Loop
-        async for state in learner.start(max_iter=10):
-            # Log metrics at each iteration step
-            mlflow.log_metric("mse", state.metric_value, step=state.iteration)
-            mlflow.log_metric("labeled_count", state.labeled_count, step=state.iteration)
-
-            print(f"Iteration {state.iteration}: MSE {state.metric_value}")
-
-        # 5. Log Final Artifacts
-        mlflow.sklearn.log_model(final_model, "surrogate_model")
-```
+| MLflow tag | Value |
+|---|---|
+| `stop_reason` | `"criterion_met"` / `"max_iter_reached"` / `"stopped"` / `"error"` |
+| `final_iteration` | Last completed iteration number |
 
 ---
 
-## What is Tracked?
-
-### Parameters
-Parameters are typically logged once at the beginning of the run to record the experimental setup.
-*   Iteration limits
-*   Stopping criteria thresholds
-*   Initial sample sizes
-*   Batch selection counts
+## Adaptive config changes
 
-### Metrics
-Metrics are logged at **each iteration step** using the `step` parameter in `mlflow.log_metric()`. This allows you to view learning curves and performance trends over time in the MLflow UI.
-*   **Performance:** MSE, Accuracy, R-squared
-*   **Workflow State:** Number of labeled samples, remaining pool size
-*   **Adaptive Features:** Current uncertainty scores, selection batch sizes
+When you call `learner.set_next_config(config)` to change hyperparameters between iterations,
+the new config appears in the next `IterationState.current_config`. MLflow captures this
+automatically in `on_iteration` — no manual `log_params()` call needed.
 
-### Artifacts and Model Registry
-At the end of the ROSE workflow, you can save:
-*   **The Model:** Register the final surrogate model in the MLflow Model Registry for deployment.
-*   **Visualizations:** Save plots of error reduction vs. iteration or sample size.
-*   **Data States:** Save the final labeled dataset for future reference.
+```python
+configs = {
+    0:  LearnerConfig(training=TaskConfig(kwargs={"--lr": 3e-4})),
+    10: LearnerConfig(training=TaskConfig(kwargs={"--lr": 1e-4})),
+    20: LearnerConfig(training=TaskConfig(kwargs={"--lr": 3e-5})),
+}
+
+async for state in learner.start(max_iter=30):
+    next_iter = state.iteration + 1
+    if next_iter in configs:
+        learner.set_next_config(configs[next_iter])
+    # MLflow records the config change — no manual call needed
+```
 
 ---
 
-## Advanced: MLflowROSETracker Helper
+## Multiple trackers
 
-For more complex workflows, the provided example includes an `MLflowROSETracker` helper class. It encapsulates common tracking logic, making the main workflow code cleaner:
+Attach MLflow alongside other trackers — they are independent observers:
 
 ```python
-tracker = MLflowROSETracker("My_Complex_Experiment")
-tracker.start_experiment(config)
-
-async for state in learner.start(max_iter=15):
-    # Automatically handles extraction and logging of relevant metrics
-    tracker.log_iteration(state)
+from rose.integrations.mlflow_tracker import MLflowTracker
 
-tracker.log_model(model, X_sample, y_sample)
-tracker.end_experiment(success=True)
+learner.add_tracker(HPC_FileTracker("run.jsonl"))       # safety net
+learner.add_tracker(MLflowTracker(experiment_name="x")) # experiment comparison
 ```
 
-For the full implementation of this helper, see the [mlflow_rose.py source code](https://github.com/radical-cybertools/ROSE/blob/main/examples/integrations/mlflow/mlflow_rose.py).
+---
+
+## `MLflowTracker` vs manual wiring
+
+The previous ROSE documentation showed a manual pattern where MLflow calls were placed
+inside the `async for` loop. That approach is now deprecated in favour of `add_tracker()`.
+
+| | `MLflowTracker` | Manual wiring |
+|---|---|---|
+| Pipeline manifest as params | Automatic | Must write `log_params(...)` manually |
+| Metrics per iteration | Automatic | Must call `log_metric(...)` inside loop |
+| Stop reason tag | Automatic | Requires try/finally |
+| MLflow code in control loop | None | Yes |
+
+!!! tip
+    If you need to log model artifacts (e.g. `mlflow.sklearn.log_model`) or custom plots,
+    add that logic to a subclass of `MLflowTracker` by overriding `on_stop`:
+
+    ```python
+    class MLflowArtifactTracker(MLflowTracker):
+        def on_stop(self, final_state, reason: str) -> None:
+            super().on_stop(final_state, reason)
+            if final_state and reason in ("criterion_met", "max_iter_reached"):
+                model = load_model(final_state.get("checkpoint_path"))
+                mlflow.sklearn.log_model(model, artifact_path="surrogate_model")
+    ```
-        def on_stop(self, final_state, reason: str) -> None:
-            super().on_stop(final_state, reason)
-            if final_state and reason in ("criterion_met", "max_iter_reached"):
-                model = load_model(final_state.get("checkpoint_path"))
-                mlflow.sklearn.log_model(model, artifact_path="surrogate_model")
-    ```
+    class MLflowArtifactTracker(MLflowTracker):
+        def on_stop(self, final_state, reason: str) -> None:
+            if final_state and reason in ("criterion_met", "max_iter_reached"):
+                model = load_model(final_state.get("checkpoint_path"))
+                mlflow.sklearn.log_model(model, artifact_path="surrogate_model")
+            super().on_stop(final_state, reason)
-        def on_stop(self, final_state, reason: str) -> None:
-            super().on_stop(final_state, reason)
-            if final_state and reason in ("criterion_met", "max_iter_reached"):
-                model = load_model(final_state.get("checkpoint_path"))
-                mlflow.sklearn.log_model(model, artifact_path="surrogate_model")
-    ```
+    class MLflowArtifactTracker(MLflowTracker):
+        def on_stop(self, final_state, reason: str) -> None:
+            if final_state and reason in ("criterion_met", "max_iter_reached"):
+                model = load_model(final_state.get("checkpoint_path"))
+                mlflow.sklearn.log_model(model, artifact_path="surrogate_model")
+            super().on_stop(final_state, reason)