Do you want to integrate machine learning into your organization and make sure that it maximizes impact on employee morale, customers, and your bottom line? This repository is supplemental material to medium series How to effectively adopt machine learning in organization?
Machine learning has become very popular in recent years. The entrance costs to the field are low with the rising popularity of massive open online courses. Most of them focus on model building, very few on model deployment. To my knowledge no of them cover the complete ML integration lifecycle in the organization beyond CRISP-DM.
The purpose of the series is to provide a practical recipe on how to make this amazing technology adoption in an organization successful step by step.
The purpose of this repository is to share with fellow data scientists on how to design and deploy Natural Language Processing application following the CRISP-DM process.
Passing all CRISP-DM phases you should end up with deployed Disaster Response Workflow Tool using Elastic Beanstalk (AWS). You can have fun and play with a tool here. Refer to this series article for more details.
During natural disasters, response teams are overwhelmed by thousands of messages either directly or through social media. They need to filter relevant requests, analyze, prioritize them to make sure that proper organization responds timely to help impacted individuals.
Empowering response teams with the workflow tool based on NLP technology would free up the resources and enable teams to react faster and so saving more lives and reducing financial loss from potential damages of public and private properties.
The dataset provided by Figure Eight contains 30000 messages and news articles drawn from 100s of different disasters. The messages have been classified into 36 different categories related to disaster response.
Messages and categories are stored in separate csv files. They are feeded to ETL pipeline which cleans, merge and save data to sqlite database. To run the ETL pipeline clone the repository and run script in terminal providing csv files paths, database filepath and table mode respectively:
cd path/to/cloned/repository/data
python process_data.py disaster_messages.csv disaster_categories.csv disaster_response.db replace
The business challenge can be translated into a classic document classification task, where each document (message) can be labeled by one or more labels out of 36 categories. It is a Multi-Label classification problem which might be approached from three perspectives:
1. Multiple-label Problem Transformation to the single-label task using Binary relevance, Classifier Chains or Label Powerset
2. Adapted Algorithms to directly perform multi-label classification rather than transforming the problem into single-label
3. Ensemble Approaches that construct an ensemble of base multi-label classifiers.
Referring to deployment requirements prioritizing speed, memory usage, and fast retraining of the model as new messages feed in over model performance the binary relevance method has been chosen.
Classifier chains would be beneficial as some labels are highly correlated but it takes a long time to train.
Label Powerset is not practical due to a high number of categories. It would highly increase the unbalance ratio of the dataset, where some of the transformed classes would have only one observation.
Even if adapted algorithms and ensemble approaches would most probably improved model performance they are not an option due to high memory usage and long training time.
As it is equally important to correctly identify the true danger to human life (recall) as well as the false danger (precision) to safe limited resources which are usually missing during the disaster, the f1 score matric was selected to measure model performance.
The objective is to improve model discriminant power among 36 categories. The same weight needs to be put on each category during learning ignoring the highly skewed distribution of categories in the dataset. This can be done using f1_macro
score and selecting ML algorithms more robust to unbalance datasets.
The reference model is DummyClassifier
generating predictions uniformly at random strategy=uniform
and putting the same weight on each class.
Two simple classifiers comply with the requirements of the lightweight web app and can effectively deal with unbalanced datasets. LinearSVC
with class_weight='balanced'
and ComplementNB
.
Model | mean_train_f1_macro | mean_test_f1_macro |
---|---|---|
LinearSVC | 0.898 | 0.434 |
ComplementNB | 0.420 | 0.292 |
Reference | 0.122 | 0.120 |
Linear Support Vector Classifier performance is best compared to Naive Bayes and reference. It was selected for further hyper-parameters tunning to improve performance and reduce severe overfitting.
Quantitative parameters as text_length
, genre
, starting_verb
, and ngram_range
have been tested to improve performance in a grid search. Using bi-grams has proven to be beneficial for performance, but slightly increased overfitting.
The quantitative parameters were not important at all.
Note: orange = train, blue = test
Feature selection based on L1 regularization has been selected to reduce the number of features from ~26000 to ~4000. Penalty parameter C was tuned as well with ngram_range
.
There is an interaction with feature selection and ngram_range. With no feature selection (ref. Improve performance) using bi-grams improved performance, but with feature selection using bi-grams deteriorate performance.
Reducing the penalty parameter has a huge impact on reducing overfitting.
For the current dataset best model is using feature selection with L1 regularization, uni-grams and penalty parameter C=0.1
If the training dataset is updated it is possible to retrain the model using ML pipeline which uses a grid search with the following parameters ngram_range = [(1, 1), (1, 2)]
and C=[0.01, 0.1, 1]
.
You need to provide a path to the database and path and name of the model file. Script serializes fitted GridSearchCV
object to pickle file:
cd path/to/cloned/repository/models
train_classifier.py ../data/disaster_response.db model.pickle
Thousands of messages need to be analyzed each day during a disaster. The service needs to be quite fast and scalable. Therefore it is favorable to use simple ML models that do not use a lot of memory to have fast prediction time and training is not computationally expensive.
On the other hand the performance of those models can be worse which can be improved by using a more frequent re-training cycle. This would recure functionality to enable users to correct wrong classification results to further improve model performance.
To engage users to see the model in the action flask web application was developed and deployed using Elastic Beanstalk (AWS). You can try it here. It is possible to submit a message to classify and shows statistics of training dataset as label distribution and word frequencies as a word cloud.
The entered message is labeled by 1 or more labels out of 36 categories. App also explains the reasons for each predicted label. It shows the top 10 words as detractors and supporters for each label that the user can see how the model recognizes each class. The app parses the message and explains how each word contributes to a particular label selection.
To integrate model to production-ready workflow tool the three RESTful APIs in Flask should be developed.
predict
function to return JSON including 36 categories and their binary indicatorsexplain_label
function to return JSON with top k words and their model coefficients.explain_message
function to parse entered message and return message words and their model coefficients.
This is out of the scope of the project.
-
Run the following commands in the project's root directory to set up your database and model.
- To run the ETL pipeline that cleans data and stores in the database
cd data python process_data.py disaster_messages.csv disaster_categories.csv disaster_response.db replace
- To run ML pipeline that trains classifier and saves
cd ../models python train_classifier.py disaster_response.db model.pickle
- To run the ETL pipeline that cleans data and stores in the database
-
Run the following command in the app's directory to run your web app.
cd ../app python run.py
-
Go to http://0.0.0.0:3001/
Must give credit to Figure Eight Technologies for the data, @udacity for the starter code, and Shubham Jain @shubhamjn1 for nice introduction to multi-label classification.