Skip to content

Lineax17/r-heart-disease

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Heart Failure Prediction

Source of the Dataset

Dataset: "Heart Failure Prediction" by fedesoriano, available from Kaggle.

Licensed under the Open Data Commons Open Database License (ODbL) v1.0.

Accessible at the following link: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction

Description

The dataset contains 13 features and a target variable indicating whether a patient has heart failure (1) or not (0). The features include demographic information, clinical measurements, and laboratory test results. The goal is to compare three different machine learning models at predicting heart failure based on these features.

Features

Variable Description
Age Age of the patient in years.
Sex Sex of the patient: M = Male, F = Female.
ChestPainType Type of chest pain:
TA = Typical Angina
ATA = Atypical Angina
NAP = Non-Anginal Pain
ASY = Asymptomatic
RestingBP Resting blood pressure (in mmHg). Values >120 mmHg are considered elevated.
Cholesterol Serum cholesterol in mg/dL. High values may indicate increased risk of heart disease.
FastingBS Fasting blood sugar > 120 mg/dL: 1 = True, 0 = False. High levels may suggest diabetes.
RestingECG Resting electrocardiogram results:
Normal = Normal ECG
ST = ST-T wave abnormality
LVH = Left Ventricular Hypertrophy
MaxHR Maximum heart rate achieved during exercise stress test (in bpm).
ExerciseAngina Exercise-induced angina: Y = Yes, N = No.
Oldpeak ST depression induced by exercise relative to rest. Indicates possible myocardial ischemia.
ST_Slope Slope of the peak exercise ST segment:
Up = Upsloping
Flat = Flat
Down = Downsloping.
HeartDisease Target variable: 1 = Presence of heart disease, 0 = Absence of heart disease.

Feature Plots

Numeric Variables

num_1 num_2 num_3 num_4 num_5

Categorial Variables

cat_1 cat_2 cat_3 cat_4

Data Cleaning

  • Feature exclusion:
    The Sex feature was removed because its distribution was skewed and not representative.

  • Missing / invalid values:
    About 20% of cholesterol values are recorded as 0, which is biologically impossible.
    → Solution: rows with cholesterol = 0 were removed since these cases were mostly labeled as “diseased” and would bias the model.


Models

Applied models:

  1. Logistic Regression
  2. Decision Tree
  3. Random Forest

Validation method:
10-fold cross-validation was performed to ensure model robustness.


Results

Logistic Regression

Confusion Matrix:

Actual: No Actual: Yes
Predicted: No 103 14
Predicted: Yes 14 92

Decision Tree

Confusion Matrix:

Actual: No Actual: Yes
Predicted: No 89 11
Predicted: Yes 28 95
Decision_Tree_Diagram

Random Forest

Confusion Matrix:

Actual: No Actual: Yes
Predicted: No 101 16
Predicted: Yes 16 90
Random_Forest_Diagram

Dimensionality Reduction

Based on the Mean Decrease Accuracy from the Random Forest model, the following features were excluded:

  • Cholesterol
  • FastingBS

Result:
Excluding these features had little to no impact on model performance, so they were safely removed.

Logistic Regression (After Feature Reduction)

Confusion Matrix:

Actual: No Actual: Yes
Predicted: No 103 15
Predicted: Yes 14 91

Decision Tree (After Feature Reduction)

Confusion Matrix:

Actual: No Actual: Yes
Predicted: No 89 11
Predicted: Yes 28 95

Random Forest (After Feature Reduction)

Confusion Matrix:

Actual: No Actual: Yes
Predicted: No 100 15
Predicted: Yes 17 91

Interpretation of Results

Most influential features:

  • ST_Slope
  • Oldpeak
  • ExerciseAngina
  • ChestPainType

These are medically meaningful:

  • Changes in ST_Slope and Oldpeak indicate ischemic changes in the heart.
  • ExerciseAngina and ChestPainType are direct indicators of coronary insufficiency.

Cholesterol, although generally relevant to heart disease, lost its predictive power here due to faulty data (values = 0).


Model Selection

Since false negatives (predicting a sick patient as healthy) are more critical than false positives, the best model was chosen based on minimizing false negatives.

➡️ Best model: Decision Tree


Improvement Suggestions

  • Collect more laboratory data (e.g., lipid profiles, inflammation markers) to improve prediction quality.
  • Improve data quality control to prevent biologically impossible entries.
  • Consider testing advanced models such as Gradient Boosting or Neural Networks for further optimization.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages