GastUJB is a large-scale, internally curated gastric whole slide image dataset collected at the Catholic University of Korea Uijeongbu St. Mary’s Hospital between 2014 and 2023.
It is designed to serve as a primary cohort for defining diagnostic taxonomies and evaluating fine-grained computational pathology tasks. The dataset features comprehensive coverage of gastric diagnostic categories with hierarchical labeling.
- Scale: Contains a total of 12,079 Whole Slide Images (WSIs).
- Real-World Data: Reflects routine clinical practice, covering a wide spectrum of diagnostic categories.
- Hierarchical Labels: Supports multi-level diagnostic tasks (fine-grained evaluation).
- Robust Evaluation: The train/test split strictly ensures patient-level independence to prevent data leakage.
The dataset is partitioned into training and testing sets. While we do not enforce equal class ratios (to reflect natural clinical prevalence), the split is designed to prevent excessive class imbalance while preserving the cohort's overall distribution.
| Dataset Split | Slide Count ( |
Description |
|---|---|---|
| GastUJB-Train | Primary training set for model development. | |
| GastUJB-Test | Held-out test set for validation and benchmarking. | |
| Total | All de-identified slides. |
GastUJB covers a comprehensive range of gastric pathology categories. It serves as the primary cohort for defining the study's diagnostic taxonomy. The labels are organized hierarchically to allow for three coarse-level tasks, eleven fine-level, and five grade-level tasks.
Access to the GastUJB dataset is available upon request. To obtain the dataset, researchers must sign a data usage agreement.
Please send an email to the contact listed below with the following details:
- Subject: Request for GastUJB Dataset Access
- Body: Please include your name, institution, and a brief description of how the data will be used.
Once access is granted, the dataset is organized as follows:
GastUJB/
├── 2014/
│ ├── 2014_process
│ ├── coordinates
│ ├── mask
│ └── ...
│ ├── 2014_raw
│ ├── slide_1
│ ├── data_1.dat
│ ├── data_2.dat
│ └── ...
│ └── ...
│ ...
├── 2023/
│ ├── 2023_process
│ ├── coordinates
│ ├── mask
│ └── ...
│ ├── 2023_raw
│ ├── slide_1
│ ├── data_1.dat
│ ├── data_2.dat
│ └── ...
│ └── ...
└── clinical_metadata.csv
Email: [email protected]
