A privacy-first benchmark framework for evaluating ML/AI methods on ICU prediction tasks across 17+ hospital sites using the CLIF (Common Longitudinal ICU Format) data standard.
In the ICU, clinical decisions happen in real-time and directly impact patient survival. Rich dataβdiagnoses, laboratory values, vital signs, and outcomesβcan drive better care, and AI offers new ways to support clinical reasoning in these high-stakes environments. Yet building trustworthy, generalizable AI requires diverse datasets that reflect varied patient populations and clinical practices.
Public ICU datasets like MIMIC and eICU have enabled significant research progress, but they represent only a handful of institutions. Meanwhile, the vast majority of ICU data remains locked in private hospital silosβinaccessible for multi-site validation due to privacy regulations. This fragmentation limits our ability to develop AI that generalizes beyond the institutions where it was trained.
CLIF and FLAIR bridge this gap. The Common Longitudinal ICU Format (CLIF) provides a shared data standard that harmonizes ICU data across institutions. FLAIR builds on this foundation, enabling federated model evaluation on real-world private data from 17+ hospitalsβwithout patient information ever leaving each site.
Existing ICU benchmarks like YAIB and HiRID-ICU-Benchmark have done excellent work harmonizing public datasets (MIMIC, eICU, HiRID, AUMCdb). However, models validated solely on public data face a critical limitation:
Models that perform well on public benchmarks may not generalize to real-world clinical settings.
Public datasets represent a handful of institutions with specific patient populations, workflows, and documentation practices. Real-world deployment requires validation across diverse hospitals.
Multi-site validation is essential, but patient data is protected by strict privacy regulations (HIPAA, IRB). Raw clinical data cannot be shared or centralized for traditional benchmark evaluation.
FLAIR enables researchers to validate their methods on private ICU datasets from 17+ US hospitals without the data ever leaving each site:
Result: Your model gets evaluated on real-world private clinical data from diverse institutions, enabling robust generalization assessment.
FLAIR is built on the Common Longitudinal ICU Format (CLIF) data standard, maintained by a consortium of 17+ academic medical centers across the United States.
| Metric | Value |
|---|---|
| Participating Sites | 17+ |
| Geographic Coverage | Nationwide (US) |
| Combined ICU Beds | 2,000+ |
| Data Standard | CLIF v2.1 |
The consortium includes major academic medical centers, community hospitals, and health systemsβproviding diverse patient populations, clinical workflows, and documentation practices for robust model validation.
Since CLIF consortium data is private, how do you develop your method? MIMIC-CLIF is the answer:
βββββββββββββββββββ ββββββββββββββββββ-β
β MIMIC-CLIF β β CLIF Consortium β
β (Public) β β (Private) β
βββββββββββββββββββ€ ββββββββββββββββββ-β€
β β’ PhysioNet β Same β β’ 17+ Sites β
β β’ ~70k ICU staysβ βββββββΊ β β’ 500k+ ICU staysβ
β β’ Single site β Schema β β’ Diverse pops β
β β’ Development β β β’ Evaluation β
βββββββββββββββββββ ββββββββββββββββββ-β
This approach ensures your code will work on consortium data without modification.
FLAIR provides 7 clinically relevant prediction tasks:
| Task | Name | Description | Cohort Filter |
|---|---|---|---|
| 1 | Discharged Home | Predict if patient will be discharged directly home | All ICU patients |
| 2 | Discharged to LTACH | Predict if patient will go to long-term acute care | All ICU patients |
| 6 | Hospital Mortality | Predict in-hospital death (first 24hr ICU data) | 1st ICU stay β₯ 24hr |
| 7 | Unplanned ICU Readmission | Predict unplanned return to ICU (entire 1st ICU stay data) | 1st ICU stay β₯ 24hr |
| Task | Name | Description | Cohort Filter |
|---|---|---|---|
| 3 | 72-Hour Respiratory Outcome | Predict ventilator status at 72hr (on/off/expired) | IMV at 24hr only |
| Task | Name | Description | Cohort Filter |
|---|---|---|---|
| 4 | Hypoxic Proportion | Predict fraction of hypoxic hours (24-72hr window) | IMV at 24hr only |
| 5 | ICU Length of Stay | Predict 1st ICU stay duration (first 24hr data) | 1st ICU stay β₯ 24hr |
Each task defines a time window for data extraction:
window_start: Beginning of the data collection periodwindow_end: The prediction time - this is when the model makes its prediction
Critical: You can use ALL data points within the window (window_start to window_end), but you CANNOT use any data after window_end. Using data beyond the prediction time would be data leakage.
The window definition varies by task:
- Tasks 1-6: Window is first 24 hours from ICU admission (
first_icu_start_timeto+24hr) - Task 7: Window is the entire first ICU stay (
first_icu_start_timetofirst_icu_end_time)
The window is task-specific and not always aligned with ICU admission/discharge times.
Community-Driven: These tasks are driven by community needs. Have a cool prediction task idea? Open a PR or Issue! We're actively working on adding more tasks.
Note: Each task has its own cohort size (N) based on task-specific filters. All tasks share the same base criteria: hospitalizations with at least 1 ICU stay.
FLAIR is a Python library that generates task-specific datasets for ICU prediction benchmarks:
| FLAIR Provides | You Provide |
|---|---|
| β Task-specific cohort filtering | π§ Your feature engineering |
| β Consistent label extraction | π§ Your model architecture |
| β Temporal train/test splits | π§ Your training pipeline |
| β Demographics & time windows | π§ Your evaluation |
Each task outputs a single parquet file with all required columns:
| Column | Type | Description |
|---|---|---|
hospitalization_id |
str | Unique identifier |
admission_dttm |
datetime | Hospital admission time |
discharge_dttm |
datetime | Hospital discharge time |
window_start |
datetime | ICU start time (input window start) |
window_end |
datetime | Prediction time (+24hr from ICU start) |
{task_label} |
int/float | Task-specific label |
split |
str | "train" or "test" |
age_at_admission |
int | Patient age |
sex_category |
str | Patient sex |
race_category |
str | Patient race |
ethnicity_category |
str | Patient ethnicity |
# Install with pip
pip install flair-benchmark
# Or from source
git clone https://github.com/clif-consortium/FLAIR.git
cd FLAIR
pip install -e .Requirements: Python 3.10+, clifpy
cp clif_config_template.json clif_config.jsonEdit clif_config.json to set your data path and timezone.
from flair_benchmark import generate_task_dataset, TASK_REGISTRY
# View available tasks
print(TASK_REGISTRY.keys())
# ['task1_discharged_home', 'task2_discharged_ltach', 'task3_outcome_72hr',
# 'task4_hypoxic_proportion', 'task5_icu_los', 'task6_hospital_mortality',
# 'task7_icu_readmission']
# Generate dataset for ICU LOS task with temporal split
df = generate_task_dataset(
config_path="clif_config.json",
task_name="task5_icu_los",
train_start="2020-01-01",
train_end="2022-12-31",
test_start="2023-01-01",
test_end="2023-12-31",
output_path="task5_icu_los.parquet"
)
print(f"Total N: {len(df)}")
print(f"Train: {len(df.filter(df['split'] == 'train'))}")
print(f"Test: {len(df.filter(df['split'] == 'test'))}")import polars as pl
# Load dataset
df = pl.read_parquet("task5_icu_los.parquet")
# Split into train/test
train = df.filter(pl.col("split") == "train")
test = df.filter(pl.col("split") == "test")
# Access labels
y_train = train["icu_los_hours"]
y_test = test["icu_los_hours"]
# Access demographics for subgroup analysis
demographics = train.select(["age_at_admission", "sex_category", "race_category"])ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FLAIR PRIVACY POLICY β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β β
β 1. NO NETWORK REQUESTS β
β β’ All network access blocked at Python socket level β
β β’ Packages like requests, urllib3, httpx are banned β
β β’ Violation = immediate submission rejection β
β β
β 2. PHI PROTECTION β
β β’ All outputs scanned for PHI patterns β
β β’ Cell counts < 10 are suppressed (HIPAA safe harbor) β
β β’ Individual-level data never leaves the site β
β β
β 3. REVIEW PROCESS β
β β’ PIs at each site review code before execution β
β β’ PIs have final say β they are not required to run β
β β’ Code inspection for data exfiltration attempts β
β β
β 4. CONSEQUENCES β
β β’ If data exfiltration found during review: β
β β Submitter BANNED from FLAIR β
β β Incident reported to submitter's institution β
β β
β PIs are doing you a favor by running your code on their data. β
β Respect their trust and protect patient privacy. β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
FLAIR/
βββ flair/ # Main package
β βββ __init__.py # Main API: generate_task_dataset()
β βββ cohort/ # Cohort builder (clifpy integration)
β βββ config/ # Configuration management
β βββ datasets/ # Dataset builder
β βββ helpers/ # table1, metrics, tripod_ai
β βββ tasks/ # Task definitions (7 tasks)
β βββ base.py # BaseTask with build_task_dataset()
β βββ task1_discharged_home.py
β βββ task2_discharged_ltach.py
β βββ task3_outcome_72hr.py
β βββ task4_hypoxic_proportion.py
β βββ task5_icu_los.py
β βββ task6_hospital_mortality.py
β βββ task7_icu_readmission.py
βββ clif_config_template.json # CLIF data configuration template
βββ pyproject.toml # Package configuration
βββ tests/ # Test suite
| Project | Description |
|---|---|
| clifpy | Python library for CLIF data manipulation |
| MIMIC-CLIF | CLIF-formatted MIMIC-IV (development entry point) |
| CLIF Consortium | Official CLIF consortium website |
If you use FLAIR in your research, please cite:
@software{flair2024,
title = {FLAIR: Federated Learning Assessment for ICU Research},
author = {CLIF Consortium},
year = {2024},
url = {https://github.com/clif-consortium/FLAIR}
}This source code is released under the APACHE 2.0 license. See LICENSE for details.
We do not own any of the clinical datasets used with this benchmark. Access to CLIF consortium data requires approval from each participating institution.
- Website: clif-icu.com
- Email: clif_consortium@uchicago.edu
- Issues: GitHub Issues
