This project investigates how machine learning models can predict skin cancer status (benign vs. malignant) using demographic, environmental, biological, and behavioral features.
Using a supervised learning framework, we:
- Analyze relationships between risk factors and cancer status
- Engineer meaningful features using domain knowledge
- Compare multiple models to optimize predictive performance
- Which factors are most predictive of skin cancer risk?
- How do different modeling approaches compare in performance?
- Does increasing model complexity improve generalization?
We use a Kaggle dataset consisting of:
- 50,000 training observations
- 20,000 test observations
- ~50 predictors
Each observation includes:
- Demographic information (
age, etc.) - Environmental exposure (
UV levels) - Biological traits (
skin tone,immunosuppression) - Behavioral factors (
sun protection habits) - Medical history (
family history,lesions)
- Identified substantial missing data (~196k missing values)
- Missingness pattern consistent with MCAR
- Avoided row deletion due to high data loss
- Used MissForest (Random Forest–based imputation)
- Captures nonlinear relationships
- Outperformed MICE, KNN, and median imputation in test performance
- Log transformations for skewed variables
- Square-root transformation for UV exposure
- Polynomial features (e.g., age²)
- Interaction terms (e.g., UV × skin sensitivity)
- Noise variable filtering using correlation + PCA
- missing data revealed a uniformly distributed pattern across all predictors, and there is no evidence of systematic clustering.
- Dataset is balanced (~50% benign / 50% malignant)
- Age positively correlates with cancer risk
- Fair skin tones have higher malignancy rates
- Immunosuppression is a strong predictor
- Family history significantly increases risk
We trained and compared multiple classification models:
- Logistic Regression
- LASSO / Ridge
- Random Forest
- XGBoost
Logistic Regression with:
- Engineered nonlinear features
- Interaction terms
- Threshold tuning (0.495)
- Bagging (100 iterations)
- Baseline: 0.60420
-
- MissForest: 0.60485
-
- Feature Engineering: 0.60520
-
- Threshold Tuning: 0.60565
-
- Bagging: 0.60620
- Negative result: complex models did NOT outperform simpler ones
- Logistic regression generalized best
- Performance plateau (~60%) suggests high noise in dataset
- Dataset likely synthetic (balanced classes, uniform missingness)
- Missing key clinical variables (e.g., images, genetic markers)
- High noise limits achievable accuracy
- Model evaluated on a single test distribution
- R
tidyversemissForestglmnetrandomForest
- Methods:
- Logistic Regression
- Ensemble Learning (Bagging)