Dataset: "Heart Failure Prediction" by fedesoriano, available from Kaggle.
Licensed under the Open Data Commons Open Database License (ODbL) v1.0.
Accessible at the following link: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction
The dataset contains 13 features and a target variable indicating whether a patient has heart failure (1) or not (0). The features include demographic information, clinical measurements, and laboratory test results. The goal is to compare three different machine learning models at predicting heart failure based on these features.
| Variable | Description |
|---|---|
Age |
Age of the patient in years. |
Sex |
Sex of the patient: M = Male, F = Female. |
ChestPainType |
Type of chest pain: • TA = Typical Angina • ATA = Atypical Angina • NAP = Non-Anginal Pain • ASY = Asymptomatic |
RestingBP |
Resting blood pressure (in mmHg). Values >120 mmHg are considered elevated. |
Cholesterol |
Serum cholesterol in mg/dL. High values may indicate increased risk of heart disease. |
FastingBS |
Fasting blood sugar > 120 mg/dL: 1 = True, 0 = False. High levels may suggest diabetes. |
RestingECG |
Resting electrocardiogram results: • Normal = Normal ECG • ST = ST-T wave abnormality • LVH = Left Ventricular Hypertrophy |
MaxHR |
Maximum heart rate achieved during exercise stress test (in bpm). |
ExerciseAngina |
Exercise-induced angina: Y = Yes, N = No. |
Oldpeak |
ST depression induced by exercise relative to rest. Indicates possible myocardial ischemia. |
ST_Slope |
Slope of the peak exercise ST segment: • Up = Upsloping • Flat = Flat • Down = Downsloping. |
HeartDisease |
Target variable: 1 = Presence of heart disease, 0 = Absence of heart disease. |
-
Feature exclusion:
TheSexfeature was removed because its distribution was skewed and not representative. -
Missing / invalid values:
About 20% of cholesterol values are recorded as0, which is biologically impossible.
→ Solution: rows withcholesterol = 0were removed since these cases were mostly labeled as “diseased” and would bias the model.
Applied models:
- Logistic Regression
- Decision Tree
- Random Forest
Validation method:
10-fold cross-validation was performed to ensure model robustness.
Confusion Matrix:
| Actual: No | Actual: Yes | |
|---|---|---|
| Predicted: No | 103 | 14 |
| Predicted: Yes | 14 | 92 |
Confusion Matrix:
| Actual: No | Actual: Yes | |
|---|---|---|
| Predicted: No | 89 | 11 |
| Predicted: Yes | 28 | 95 |
Confusion Matrix:
| Actual: No | Actual: Yes | |
|---|---|---|
| Predicted: No | 101 | 16 |
| Predicted: Yes | 16 | 90 |
Based on the Mean Decrease Accuracy from the Random Forest model, the following features were excluded:
CholesterolFastingBS
Result:
Excluding these features had little to no impact on model performance, so they were safely removed.
Confusion Matrix:
| Actual: No | Actual: Yes | |
|---|---|---|
| Predicted: No | 103 | 15 |
| Predicted: Yes | 14 | 91 |
Confusion Matrix:
| Actual: No | Actual: Yes | |
|---|---|---|
| Predicted: No | 89 | 11 |
| Predicted: Yes | 28 | 95 |
Confusion Matrix:
| Actual: No | Actual: Yes | |
|---|---|---|
| Predicted: No | 100 | 15 |
| Predicted: Yes | 17 | 91 |
- ST_Slope
- Oldpeak
- ExerciseAngina
- ChestPainType
These are medically meaningful:
- Changes in ST_Slope and Oldpeak indicate ischemic changes in the heart.
- ExerciseAngina and ChestPainType are direct indicators of coronary insufficiency.
Cholesterol, although generally relevant to heart disease, lost its predictive power here due to faulty data (values = 0).
Since false negatives (predicting a sick patient as healthy) are more critical than false positives, the best model was chosen based on minimizing false negatives.
➡️ Best model: Decision Tree
- Collect more laboratory data (e.g., lipid profiles, inflammation markers) to improve prediction quality.
- Improve data quality control to prevent biologically impossible entries.
- Consider testing advanced models such as Gradient Boosting or Neural Networks for further optimization.