Skip to content

Commit f29eb14

Browse files
Merge pull request #77 from NSAPH/mahimakaur/issue76
incorporated lego data model .md
2 parents 3cb24b5 + 4ddd55a commit f29eb14

8 files changed

Lines changed: 132 additions & 0 deletions

handbook/_toc.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ parts:
2424
chapters:
2525
- file: data
2626
- file: analytic
27+
- file : lego_data_model
2728
- file: mcbs
2829
- url: https://docs.google.com/document/d/1Nu6YG0NTazTW-jgWZgIlszdL1BUbPNCs5wXrPfK5XjM/edit?usp=sharing
2930
title: Data Mangement Plan Template

handbook/imgs/lego.jpg

23.5 KB
Loading

handbook/imgs/lego_domains.png

184 KB
Loading
260 KB
Loading
325 KB
Loading
287 KB
Loading

handbook/imgs/lego_system.png

74.9 KB
Loading

handbook/lego_data_model.md

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# LEGO Data Model
2+
3+
#### Overview
4+
5+
The LEGO Data Model is a structured framework designed to standardize data storage, structure, and management across various projects. By adopting a modular approach—similar to LEGO building blocks—we ensure consistency, reproducibility, and scalability in data organization.
6+
7+
Access the **LEGO Catalog** at [🔗 https://lego-catalog.netlify.app/](https://lego-catalog.netlify.app/)
8+
9+
```{figure} imgs/lego.jpg
10+
---
11+
scale: 40%
12+
align: right
13+
---
14+
```
15+
16+
#### Why the LEGO Data Model?
17+
18+
The LEGO Data Model is inspired by the modularity and standardization of LEGO blocks. Its key principles include:
19+
- Modular Structure: Data is organized into well-defined, reusable components.
20+
- Standardized Formats: Ensures interoperability between datasets.
21+
- Hierarchical Organization: Data is structured by domains, subdomains, and time resolutions.
22+
- Predefined Schema: Every dataset follows a standard schema with linking elements like county IDs.
23+
24+
```{figure} imgs/lego_system.png
25+
---
26+
scale: 40%
27+
align: right
28+
---
29+
```
30+
31+
#### Data Standards in the LEGO Data Model
32+
33+
##### File Formats
34+
To optimize storage and processing, the LEGO Data Model supports:
35+
- Parquet: Columnar storage format for tabular data.
36+
- Shapefile (SHP): Used for spatial data.
37+
38+
##### Folder Structure & Naming Conventions
39+
All datasets in our lab follow a structured hierarchy to ensure logical arrangement and easy access.
40+
41+
##### Folder Hierarchy Example
42+
```
43+
<main lab folder>/lego
44+
├── <domain>
45+
│ ├── <subdomain>__<data_source>
46+
│ │ ├── <geo_resolution>__<time_resolution>
47+
│ │ │ ├── <filename>_yyyy.parquet
48+
```
49+
50+
##### Key Folder Components
51+
52+
- Domain: Broad research category (e.g., health, environment, social).
53+
- Subdomain: Specific dataset type (e.g., hospitalization, demographics, air pollution).
54+
- Data Source: The origin of the dataset (e.g., Medicare, Medicaid).
55+
- Geographic Resolution: The spatial granularity (e.g., county, state, ZCTA).
56+
- Time Resolution: The temporal frequency (e.g., annual, monthly, daily).
57+
- File Naming Convention: Maintains consistency for dataset identification.
58+
59+
##### Notes
60+
- Files are stored yearly.
61+
- All files that share a common datapath/filename should have identical variables/columns.
62+
63+
#### Navigating the LEGO Data Model
64+
65+
The LEGO Data Model includes five domains, accessible via :
66+
67+
* **FASSE** `/n/dominici_nsaph_l3/Lab/lego`
68+
* **CANNON** `/n/dominici_lab/lab/lego` (excluding health)
69+
70+
##### Domains
71+
72+
From the home of the LEGO Catalog, both the **Content** and **Subdatasets** tabs list the LEGO domains:
73+
74+
- `medicare` – Core datasets related to Medicare beneficiaries, encompassing health outcomes such as mortality, hospital admissions, and conditions including cardiovascular diseases, respiratory diseases, cancer, asthma, Alzheimer's disease and related dementias, among others. The end-to-end preprocessing is fully reproducible and extensible.
75+
- `legacy_medicare` – Datasets related to Medicare derived from preprocessed denominator and admissions data. The "legacy" denomination is utilized as the processing steps from the raw source are not fully reproducible and therefore the product cannot be continued or extended.
76+
- `environmental` – Datasets capturing environmental exposures, including climate-related factors (temperature, cyclones, humidity, heat alerts, and heat waves) and pollution data.
77+
- `geoboundaries` – Geographical datasets containing shapefiles, crosswalks, and unique geospatial identifiers sourced from the U.S. Census Bureau. These datasets are essential for linking health and environmental data to geographic areas. Within the LEGO data model they represent the "geographical backbone" of the data.
78+
- `social` – Datasets providing demographic and socioeconomic insights such as population distribution by age groups, ethnic composition, housing statistics, and other key social variables.
79+
80+
```{figure} imgs/lego_domains.png
81+
---
82+
scale: 40%
83+
align: center
84+
---
85+
```
86+
87+
##### Dataset Details
88+
89+
Each dataset page provides:
90+
- Description – Overview of the dataset including Keywords and Properties..
91+
- Datapaths – Folder location for dataset access.
92+
- Content or Subdatasets – List of files within the dataset.
93+
94+
Each file page includes:
95+
- Description – Overview of the file contents.
96+
- Data Dictionary – Comprehensive list of variables and their descriptions.
97+
98+
99+
100+
#### Example: Navigating the Medicare Core Datasets
101+
102+
- Access the Dataset Overview : Navigate to the Home Contents tab and select `medicare`. View the description, metadata, file path, and keywords.
103+
104+
```{figure} imgs/lego_medicare_datapath.png
105+
---
106+
scale: 22%
107+
align: center
108+
---
109+
```
110+
111+
- View the Content or Subdatasets : Click the Content or Subdatasets tabs to explore related files (e.g., mortality, admissions, outcomes).
112+
113+
```{figure} imgs/lego_medicare_content.png
114+
---
115+
scale: 25%
116+
align: center
117+
---
118+
```
119+
120+
- Explore Specific Files : Click on a file to view its data dictionary (e.g., `zcta_yearly → counts_yyyy.parquet`).
121+
122+
```{figure} imgs/lego_medicare_data_dictionary.png
123+
---
124+
scale: 25%
125+
align: center
126+
---
127+
```
128+
129+
#### Leveraging the LEGO Data Model for Data Requests
130+
131+
The LEGO Data Model ensures consistency, collaboration, and reproducibility. Researchers can search for datapaths, data dictionaries, and associated data pipelines on github, to collaborate seamlessly through a shared centralized dataset structures.

0 commit comments

Comments
 (0)