Heart Failure Prediction

Source of the Dataset

Dataset: "Heart Failure Prediction" by fedesoriano, available from Kaggle.

Licensed under the Open Data Commons Open Database License (ODbL) v1.0.

Accessible at the following link: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction

Description

The dataset contains 13 features and a target variable indicating whether a patient has heart failure (1) or not (0). The features include demographic information, clinical measurements, and laboratory test results. The goal is to compare three different machine learning models at predicting heart failure based on these features.

Features

Variable	Description
`Age`	Age of the patient in years.
`Sex`	Sex of the patient: `M` = Male, `F` = Female.
`ChestPainType`	Type of chest pain: • `TA` = Typical Angina • `ATA` = Atypical Angina • `NAP` = Non-Anginal Pain • `ASY` = Asymptomatic
`RestingBP`	Resting blood pressure (in mmHg). Values >120 mmHg are considered elevated.
`Cholesterol`	Serum cholesterol in mg/dL. High values may indicate increased risk of heart disease.
`FastingBS`	Fasting blood sugar > 120 mg/dL: `1` = True, `0` = False. High levels may suggest diabetes.
`RestingECG`	Resting electrocardiogram results: • `Normal` = Normal ECG • `ST` = ST-T wave abnormality • `LVH` = Left Ventricular Hypertrophy
`MaxHR`	Maximum heart rate achieved during exercise stress test (in bpm).
`ExerciseAngina`	Exercise-induced angina: `Y` = Yes, `N` = No.
`Oldpeak`	ST depression induced by exercise relative to rest. Indicates possible myocardial ischemia.
`ST_Slope`	Slope of the peak exercise ST segment: • `Up` = Upsloping • `Flat` = Flat • `Down` = Downsloping.
`HeartDisease`	Target variable: `1` = Presence of heart disease, `0` = Absence of heart disease.

Feature Plots

Numeric Variables

Categorial Variables

Data Cleaning

Feature exclusion:
The Sex feature was removed because its distribution was skewed and not representative.
Missing / invalid values:
About 20% of cholesterol values are recorded as 0, which is biologically impossible.
→ Solution: rows with cholesterol = 0 were removed since these cases were mostly labeled as “diseased” and would bias the model.

Models

Applied models:

Logistic Regression
Decision Tree
Random Forest

Validation method:
10-fold cross-validation was performed to ensure model robustness.

Results

Logistic Regression

Confusion Matrix:

	Actual: No	Actual: Yes
Predicted: No	103	14
Predicted: Yes	14	92

Decision Tree

Confusion Matrix:

	Actual: No	Actual: Yes
Predicted: No	89	11
Predicted: Yes	28	95

Random Forest

Confusion Matrix:

	Actual: No	Actual: Yes
Predicted: No	101	16
Predicted: Yes	16	90

Dimensionality Reduction

Based on the Mean Decrease Accuracy from the Random Forest model, the following features were excluded:

Cholesterol
FastingBS

Result:
Excluding these features had little to no impact on model performance, so they were safely removed.

Logistic Regression (After Feature Reduction)

Confusion Matrix:

	Actual: No	Actual: Yes
Predicted: No	103	15
Predicted: Yes	14	91

Decision Tree (After Feature Reduction)

Confusion Matrix:

	Actual: No	Actual: Yes
Predicted: No	89	11
Predicted: Yes	28	95

Random Forest (After Feature Reduction)

Confusion Matrix:

	Actual: No	Actual: Yes
Predicted: No	100	15
Predicted: Yes	17	91

Interpretation of Results

Most influential features:

ST_Slope
Oldpeak
ExerciseAngina
ChestPainType

These are medically meaningful:

Changes in ST_Slope and Oldpeak indicate ischemic changes in the heart.
ExerciseAngina and ChestPainType are direct indicators of coronary insufficiency.

Cholesterol, although generally relevant to heart disease, lost its predictive power here due to faulty data (values = 0).

Model Selection

Since false negatives (predicting a sick patient as healthy) are more critical than false positives, the best model was chosen based on minimizing false negatives.

➡️ Best model: Decision Tree

Improvement Suggestions

Collect more laboratory data (e.g., lipid profiles, inflammation markers) to improve prediction quality.
Improve data quality control to prevent biologically impossible entries.
Consider testing advanced models such as Gradient Boosting or Neural Networks for further optimization.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
code		code
data		data
.gitignore		.gitignore
LICENSE		LICENSE
Rplots.pdf		Rplots.pdf
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Heart Failure Prediction

Source of the Dataset

Description

Features