Undestanding Risk Factors in Chronic Kidney Disease

This project aims to analyse a comprehensive dataset of factors which could have a role in the development of chronic kidney disease (CKD), in this specific population.

CKD has an extremely high socioeconomic impact, including: direct medical costs, affecting ~ 7.2 million people at a cost of ~ £6.4 billion per year in the UK (ckdexplained.co.uk), and 35.5 million people at a cost of $130 billion per year in the US (Medicare data 2022); indirect costs such as loss of working capacity and premature mortality; and severe Quality of Life costs.

Dataset Content

The Chronic Kidney Disease Dataset is publicly available from kaggle
It contains 1659 rows (i.e. patients) and 54 columns (i.e. variables)
Variables can be broadly divided into (this is not a exhaustive list):
- Descriptors: age; sex; gender ethnicity; body-mass index (BMI)
- Biomarkers: blood pressure; cholesterol; HbA1c
- Modifiable environmental factors: smoking; alcohol consumption; diet; exercise
- Non (or less) modifiable environemental factors: Heavy metal or chemical exposure; water quality
- Prior medical history: acude kidney disease; urinary/ renal infection
- Family history: CKD; hypertension
- Symptoms of CKD: electrolyte imbalance; edaema; fatigue; nausea

Business Requirements

The business requirement is to develop a better understanding of the risk factors in CKD, which will allow government healthcare agencies, academic researchers and biotech companies to devise more effective monitoring, prevention and treatment strategies, which aim to reduce morbidity and mortality from CKD.

Questions, Hypothesis and how to validate?

Questions

What risk factors most strongly associate with CKD?
What factors separate patients with diabetes and/ or hypertension who have CKD from those who don't i.e. what factors might be protective against CKD?

Hypotheses

Diabetes and hypertension are 2 of the biggest known risk factors for CKD. These conditions are thought to result from 'Metabolic Syndrome', which is characterised by abdominal obesity, hypertriglyceridemia, reduced High-Density Lipoprotein (HDL) cholesterol, and hyperglycemia.
I therefore predict that these markers of Metabolic Syndrome are also risk factors for CKD.
- Validation will be by using individual scatter plots grouped by CKD patient or control.
Following on from the above, I predict that physical exercise is protective against CKD.
- Validation will be by KDE plot grouped by CKD patient or control.
Lastly, I predict that medical checkup frequency, medical adherence and health literacy are protective against CKD.
- Validation will be by KDE plot grouped by CKD patient or control.

After basic analysis, it was found that patients with a family history of kidney diesase appear overrepresented in CKD cases, suggesting a genetic component.

I therefore predict that CKD patients with a family history of kidney disease will be younger than CKD patients without family history.
- Validation will be by box plots using Plotly.

Project Plan

Data aquisition and ETL; Pandas, Feature Engine
Data exploration: simple analysis and visualisation; Matplotlib
Data analysis and visulaisation to answer hypotheses; Seaborn, Plotly

A new csv file will be created at the end of ETL and read into the new notebook for more visualisations
I used stacked bar plots for categorical data because it is easy to see proportions
I used KDE plots to visualise the distribution of continuous numerical data as it is again, easy to compare groups
I used scatter plots to look for correlations in the continuous numeric variables
I used box plots to compare grouped numeric data

The rationale to map the business requirements to the Data Visualisations

My analysis allowed me to gain a deeper understanding of the factors associated with CKD in this population. Some of my hypotheses were proved wrong, which highlights the need and usefulness of the analysis, as this will inform downstream stakeholder how and where to place resources.

Analysis techniques used

I imported my data from a csv file; looked at basic information and statistics; I made a data type change to help my basic visualisation; carried out basic visualisations of all appropriate variables; transformed the DataFrame by adding summary columns; visualised and answered hypotheses using Seaborn and Plotly.
I structured the data analysis techniques according to the ETL process shown. I did basic visualisation to understand what transformations I wanted to do. When ETL was complete, I moved on to answer my questions/ hypotheses.
I think that the lack of correlations in the data was a little limiting
I used ChatGPT in particular to help with plot, title and legend optimisations

Ethical considerations

In this case, there are no ethical considerations as the dataset was already ananymised

Unfixed Bugs

I wanted to repeat a KDE plot of 'HealthManagement', it was the same data as plotted previously, but almost the same code gave me a different plot. In the end I deleted the repeat plot and asked the reader to look previously.

Development Roadmap

I understood what we have learned and could remember what was possible, but it felt like I spent quite a bit of time looking back for code.
However, I am happy that a lot of basic code feels familiar.

Main Data Analysis Libraries

Here you should list the libraries you used in the project and provide an example(s) of how you used these libraries.
Numpy
Pandas
Matplotlib
Seaborn
Plotly

Credits

I would like to acknowledge and thank Rabie El Kharoua for making this dataset available, and kaggle for hosting it.
I would like to acknowledge the use of ChatGPT and Co-Pilot in generating and debugging my code.
- Co-pilot suggested to add .gitattributes file to deal with Line Feed warning by Git

Content

The text for the Home page was taken from Wikipedia Article A
Instructions on how to implement form validation on the Sign-Up page was taken from Specific YouTube Tutorial
The icons in the footer were taken from Font Awesome

Acknowledgements (optional)

I'd like to thank Vasi for his encouragement and understanding.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Dataset		Dataset
jupyter_notebooks		jupyter_notebooks
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
.slugignore		.slugignore
=4.2.0		=4.2.0
Procfile		Procfile
README.md		README.md
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Undestanding Risk Factors in Chronic Kidney Disease

Dataset Content

Business Requirements

Questions, Hypothesis and how to validate?

Questions

Hypotheses

Project Plan

The rationale to map the business requirements to the Data Visualisations

Analysis techniques used

Ethical considerations

Unfixed Bugs

Development Roadmap

Main Data Analysis Libraries

Credits

Content

Acknowledgements (optional)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Undestanding Risk Factors in Chronic Kidney Disease

Dataset Content

Business Requirements

Questions, Hypothesis and how to validate?

Questions

Hypotheses

Project Plan

The rationale to map the business requirements to the Data Visualisations

Analysis techniques used

Ethical considerations

Unfixed Bugs

Development Roadmap

Main Data Analysis Libraries

Credits

Content

Acknowledgements (optional)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages