ApacheSpark

This repository introduces Pyspark by example and provides solutions to some machine learning consulting projects. In addition, a Spark streaming project is presented at the end.

NB. The Spark version 3.0.0 is used in this repository.

List of Pyspark materials:

Introduction to Pyspark RDD and DataFrame

Details of Pyspark DataFrame

How to setup Pyspark on Amazon AWS EC2

Introduction to Pyspark MLlib (Machine learning library)

Joining DataFrames in Pyspark

Machine learning projects using Pyspark ML library:

Linear regression consulting project :

In this project, parameter tunning using CrossValidator is used. Also, categorical features are handled.

Logistic regression consulting project :

In this project, imbalanced data issue is resolved using weightCol in LogisticRegression. Also, a datetime feature is processed. StandardScaler was used to normalize each feature to unit standard deviation and zero mean.

Tree methods consulting project (Decision tree, Random Forest, and GBT Classifiers):

This project focuses on feature importance computation. In this project, the imbalanced data issue is handled by using boosting techniques. In general, boosting algorithms are good choices for class imbalanced data.

For better results, one can use synthetic sampling methods like SMOTE and MSMOTE along with advanced boosting methods like Gradient boosting and XG Boost.

Recommender system project:

This project provides recommendation on movielens dataset based on collaborative filtering approach.

Natural Language Processing (NLP) project:

In this project, an SMS Spam detection is designed using spark NLP tools.

Introduction to Spark NLP tools along with some examples are presented here.

The design pipline includes: RegexTokenizer, StopWordsRemover, TF-IDF based feature extraction, Naive Bayes classifier.

Spark streaming:

COVID-19 Twitter Analysis using Spark Streeming:

This project creates an application that plots out the popularity of tags associated with incoming tweets streamed live from Twitter.

References:

[1] Apache Spark Documentation available at http://spark.apache.org/

[2] Kaggle open datasets available at https://www.kaggle.com/docs/datasets

[3] Spark and python for big data with pyspark, Udemy

[4] Advanced Analytics with Spark, 2nd Edition, Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
Ecommerce_Customers.csv		Ecommerce_Customers.csv
Linear_Regression_Consulting_Project.ipynb		Linear_Regression_Consulting_Project.ipynb
Logistic_Regression_Consulting_Project.ipynb		Logistic_Regression_Consulting_Project.ipynb
NLP_Project.ipynb		NLP_Project.ipynb
NLP_Tools.ipynb		NLP_Tools.ipynb
PySpark-AWS-EC2.ipynb		PySpark-AWS-EC2.ipynb
README.md		README.md
Recommender_System_Project.ipynb		Recommender_System_Project.ipynb
Spark-Streaming.ipynb		Spark-Streaming.ipynb
Tree_Methods_Consulting_Project.ipynb		Tree_Methods_Consulting_Project.ipynb
TweetRead.py		TweetRead.py
cruise_ship_info.csv		cruise_ship_info.csv
customer_churn.csv		customer_churn.csv
dataFrame-basics.ipynb		dataFrame-basics.ipynb
new_customers.csv		new_customers.csv
pyspark-MLlib.ipynb		pyspark-MLlib.ipynb
pyspark-join-DataFrames.ipynb		pyspark-join-DataFrames.ipynb
pyspark-test.ipynb		pyspark-test.ipynb
sample_linear_regression_data.txt		sample_linear_regression_data.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ApacheSpark

List of Pyspark materials:

Machine learning projects using Pyspark ML library:

Linear regression consulting project :

Logistic regression consulting project :

Tree methods consulting project (Decision tree, Random Forest, and GBT Classifiers):

Recommender system project:

Natural Language Processing (NLP) project:

Spark streaming:

COVID-19 Twitter Analysis using Spark Streeming:

References:

About

Releases

Packages

Languages

MahsaShk/ApacheSpark

Folders and files

Latest commit

History

Repository files navigation

ApacheSpark

List of Pyspark materials:

Machine learning projects using Pyspark ML library:

Linear regression consulting project :

Logistic regression consulting project :

Tree methods consulting project (Decision tree, Random Forest, and GBT Classifiers):

Recommender system project:

Natural Language Processing (NLP) project:

Spark streaming:

COVID-19 Twitter Analysis using Spark Streeming:

References:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages