Project in Distributed Data Analysis & Mining

U.S. Air Pollution - Data Analysis in Apache Spark

The goal of the project is to analyze, clustering, and classify surveys regarding U.S. air pollution levels recorded from 2000 to 2016 in a distributed, parallel environment.

The data-analysis and all the classification/regression tasks were performed on a dataset having 1.7 million of records using Apache Spark.

Regression was applied to the data to extract and construct an engineered dataset which was then used for classification through Random Forest and clustering (K-Means).

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
Report		Report
DDAM-Project_USAPollution_Antonicchio_Lusito_Palla_Sustrico.ipynb		DDAM-Project_USAPollution_Antonicchio_Lusito_Palla_Sustrico.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project in Distributed Data Analysis & Mining

About

Releases

Packages

Languages

gaetanoantonicchio/Distributed-Data-Analysis-and-Mining

Folders and files

Latest commit

History

Repository files navigation

Project in Distributed Data Analysis & Mining

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages