This repository introduces Pyspark by example and provides solutions to some machine learning consulting projects. In addition, a Spark streaming project is presented at the end.
NB. The Spark version 3.0.0 is used in this repository.
Introduction to Pyspark RDD and DataFrame
How to setup Pyspark on Amazon AWS EC2
Introduction to Pyspark MLlib (Machine learning library)
In this project, parameter tunning using CrossValidator is used. Also, categorical features are handled.
In this project, imbalanced data issue is resolved using weightCol in LogisticRegression. Also, a datetime feature is processed. StandardScaler was used to normalize each feature to unit standard deviation and zero mean.
This project focuses on feature importance computation. In this project, the imbalanced data issue is handled by using boosting techniques. In general, boosting algorithms are good choices for class imbalanced data.
For better results, one can use synthetic sampling methods like SMOTE and MSMOTE along with advanced boosting methods like Gradient boosting and XG Boost.
This project provides recommendation on movielens dataset based on collaborative filtering approach.
In this project, an SMS Spam detection is designed using spark NLP tools.
Introduction to Spark NLP tools along with some examples are presented here.
The design pipline includes: RegexTokenizer, StopWordsRemover, TF-IDF based feature extraction, Naive Bayes classifier.
This project creates an application that plots out the popularity of tags associated with incoming tweets streamed live from Twitter.
[1] Apache Spark Documentation available at http://spark.apache.org/
[2] Kaggle open datasets available at https://www.kaggle.com/docs/datasets
[3] Spark and python for big data with pyspark, Udemy
[4] Advanced Analytics with Spark, 2nd Edition, Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills