Skip to content

gaetanoantonicchio/Distributed-Data-Analysis-and-Mining

Repository files navigation

Project in Distributed Data Analysis & Mining

U.S. Air Pollution - Data Analysis in Apache Spark

                     

The goal of the project is to analyze, clustering, and classify surveys regarding U.S. air pollution levels recorded from 2000 to 2016 in a distributed, parallel environment.

The data-analysis and all the classification/regression tasks were performed on a dataset having 1.7 million of records using Apache Spark.

Regression was applied to the data to extract and construct an engineered dataset which was then used for classification through Random Forest and clustering (K-Means).