U.S. Air Pollution - Data Analysis in Apache Spark
The goal of the project is to analyze, clustering, and classify surveys regarding U.S. air pollution levels recorded from 2000 to 2016 in a distributed, parallel environment.
The data-analysis and all the classification/regression tasks were performed on a dataset having 1.7 million of records using Apache Spark.
Regression was applied to the data to extract and construct an engineered dataset which was then used for classification through Random Forest and clustering (K-Means).