- 1. Business Problem
- 2. Dataset
- 3. Solution Strategy
- 4. Mind Map Hypothesis
- 5. Top 3 Data Insights
- 6. Machine Learning Modelling
- 7. Machine Learning Performance
- 8. Bussiness Performance
- 9. Deploy
- 10. Next Steps
Disclaimer: This is a fictional bussiness case, the company, context, CEO and bussiness questions exist only in my imagination.
The Insurance All is a company that provides health insurance to your customers for a long time, however, the sales team is interested in offer a new service, a car insurance.
Similar to the health insurance, customers of this vehicle insurance plan need to pay annually to Insurance All in order to be insured for an eventual incident.
Last year, the marketing team conducted a survey with 380,000 people asking about some interest in joining into a new car insurance. The responses from clients who showed interest or not in purchasing an auto insurance were saved in a database with others attributes.
The product team selected 127,000 new customers who did not respond to the survey to participate in a campaign, in which they will receive an offer of the new car insurance product. The offer will be made by the sales team through phone calls.
However, the sales team has the capacity to make 20 thousand calls within the campaign period.
As a Data Scientist, your goal is to build a model that classifies a list of customers, informing the customer's possibility of contracting or not a car insurance. With the solution, the sales team hopes to be able to prioritize people with the greatest interest in the new product and thus optimize the campaign by only making contact with the customers most likely to make a purchase. In addition to the model, a report will be delivered containing the following information:
- Insights with the most relevant attributes of customers interested in purchasing auto insurance.
- What percentage of interested customers the sales team will be able to contact making 20,000 calls?
- By increasing the capacity to 40,000 calls, what percentage of interested customers the sales team will be able to contact?
- How many calls does sales team need to make to reach 80% of the most likely customers to purchase the car insurance?
This dataset is avaliable here. However, this project was maded in a dataset avaliable on the AWS platform and each row represent a customer and each column contain some attributes and the survey response.
Data fields
- id: Unique ID for the customer
- gender: Gender of the customer
- age: Age of the customer
- driving_license: 0 : Customer does not have DL, 1 : Customer already has DL
- region_code: Unique code for the region of the customer
- previously_insured: 1 : Customer already has Vehicle Insurance, 0 : Customer doesn't have Vehicle Insurance
- vehicle_age: Age of the Vehicle
- vehicle_damage: 1 : Customer got his/her vehicle damaged in the past, 0 : Customer didn't get his/her vehicle damaged in the past
- annual_premium: The amount customer needs to pay as premium in the year
- policy_sales_channel: Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc
- vintage: Number of Days, Customer has been associated with the company
- response: 1 : Customer is interested, 0 : Customer is not interested
- Predict wheter customer will be interested in the car insurance;
- Database with customers answers and info;
- Customers interested in car insurance;
- Performance of the model;
- Percentage of the customers reached with calls;
- Report;
- Which customer will be interested in the insurance?
- Which customers have a DL?
- What is the vehicle age range with more interested customer?
- Cumulative Gain;
- Lift Curve;
- Precision at K;
- Recall at K;
- Phone Calls only with the customers with the highest probability to take out the insurance;
Step 01. Data Description: Use descriptive statistics metrics to measure data distribution
Step 02. Feature Engineering: Create features to describe the fenomenous.
Step 03. Exploratory Data Analysis: Find insights to better describe the fenomenous and brake wrong concepts.
Step 04. Data Preparation: Select the most important features and prepare the data to the step 6.
Step 05. Machine Learning Modelling: Machine Learning model selection and training.
Step 06. Hyperparameter Fine Tunning: Find the best values of each parameter of the model.
Step 07. Business Translation: Convert the machine learning performance into business result.
Step 07. Deploy Model: Create a form to ship the model results for other teams.
Insight 01: Interested customers have between 35 and 55 years old
Insight 02: Customers with vehicles with age between 1-2 years have much more interest (75.52%) than people who have a vehicle for less than 1 year.
Insight 03: Customer that paid between $25000-$44999 represents 57.13% of the total interested customers.
All machine learning algorithms were trained using cross validation on training data, the tested models were:
- Logistic Regression
- Random Forest Classifier
- Extra Trees Classifier
- Gaussian NB
- XGB Classifier
- CatBoost Classifier
- AdaBoost Classifier
- LGBMClassifier
- KNeighbors Classifier
- Gradient Boosting Classifier
In this dataset, we have almost no relevant features for the prediction modelling, however, in this specific problem, the interest is not whether the person will buy a product or not, the main objective is to create a rank with a score for the customers most likely to accept an car insurance. The interest is to know the placement of customers on the probability list, a rank to learn problem.
With this particularity, metrics like precision, recall, f1-score don't represent the nature of the problem, We aren't interested in proving how good the model will be to making claims, the real goal is to know how good the model is for sorting a list, because of that, in rank to learn, sorting metrics are used, like precision at k, recall at k, cumulative gain and lift curve.
Precision at K: Of all the examples I predicted to be true, how many were actually true? This metric count how many predictions were correct up to k and divide by all predictions made by the model up to k.
Recall at K: Of all the examples that are actually true, how many have I predicted to be true? This metric count how many predictions were correct up to k and divide by all true examples.
Cumulative gain: X% of the customer base, ordered by probability of purchase contains Y% of everyone interested in the new product.
Lift Curve: How many times the ML model is batter than a baseline model for each % of base?
Model | Precision @K Mean | Recall @K Mean | ROCAUC |
---|---|---|---|
Logistic Regression | 0.79 ± 0.001 | 0.355 ± 0.001 | 0.807 |
Random Forest Classifier | 0.895 ± 0.105 | 0.402 ± 0.047 | 0.816 |
Extra Trees Classifier | 0.93 ± 0.099 | 0.417 ± 0.044 | 0.826 |
Gaussian NB | 0.907 ± 0.094 | 0.407 ± 0.042 | 0.821 |
XGB Classifier | 0.926 ± 0.092 | 0.416 ± 0.041 | 0.791 |
CatBoost Classifier | 0.938 ± 0.088 | 0.421 ± 0.040 | 0.797 |
AdaBoost Classifier | 0.947 ± 0.084 | 0.425 ± 0.038 | 0.811 |
LGBM Classifier | 0.954 ± 0.081 | 0.428 ± 0.036 | 0.805 |
KNeighbors Classifier | 0.949 ± 0.077 | 0.426 ± 0.035 | 0.800 |
Gradient Boosting Classifier | 0.954 ± 0.428 | 0.428 ± 0.034 | 0.807 |
GradientBoosting and LGBM showed excellent results in CV, while AdaBoost and ExtraTree had great results in Cumulative Gain and Lift Curve plots, I will compare these models in the next step to decide which one is the best.
The hyperparameter fine-tuned using Bayesian Optimization with Optuna was applied in LGBM, Gradient Boosting, Extra Trees and AdaBoost. All models showed similar results after parameter optimization, to select the best one, the use of Lift and Cumulative Gain curves were essential. As knowing the percentage of the base that we are going to reach is fundamental in rank to learn problems. These curves are a great support for data teams to evaluate the performance and very important for the administration and marketing teams know how many customers will reach with this help.
In the ranges of ~10% to ~60% of the baseline, Tunned ExtraTrees had the best results, because of this, it will be the chosen model.
The most relevant attributes to define a customer interest are:
- Age
- Region Code
- Policy Sales Channel
- Vehicle Damage
- Previously Insured
- Annual Premium
- Vintage
2. What percentage of interested customers the sales team will be able to contact making 20,000 calls?
64.618% of the interested customers. With the proposed model, the result is aprox. 2.45 times better than a random choise.
If each insurance is $1000.00 this represents $19817720.00 revenue with random choice, $48782080.00 revenue with proposed model and the difference of return is $28964360.00
3. By increasing the capacity to 40,000 calls, what percentage of interested customers the sales team will be able to contact?
99.298% of the interested customers. With the proposed model, the result is aprox. 1.9 times better than a random choise.
If each insurance is $1000.00 this represents $39635440.00 revenue with random choice, $75459780.0 revenue with proposed model and the difference of return is $35824340.00
4. How many calls does sales team need to make to reach 80% of the most likely customers to purchase the car insurance?
The Marketing team will call 80% of the interested customers if they make 25,647 phone calls.
If each insurance is $1000.00 this represents $25153260.00 revenue with random choice, $60977600.00 revenue with proposed model and the difference of return is $35824340.00
The API is hosted on Heroku and its avaliable on this URL: https://cross-sell-bot.herokuapp.com
Located in /predict, send a list of clients to get a probability of car insurance aquisition.
- Id
- Gender
- Age
- Driving License
- Region Code
- Previously Insured
- Vehicle Age
- Vehicle Damage
- Anual Premium
- Policy sales channel
- Vintage
- Id
- Gender
- Age
- Driving License
- Region Code
- Previously Insured
- Vehicle Age
- Vehicle Damage
- Anual Premium
- Policy sales channel
- Vintage
- Prediction Probability
Any employee of the Insurance All Company can use this spreadsheet and establish a ranking of customers most likely to purchase vehicle insurance, with direct production data.
For new CRISP-DS rounds, try change the deploy to a better architecture and explorate new form to deliver this product.
🏡 Would you like to chat about it?