Design and implementation of different simple MapReduce jobs used to analyze a dataset on Covid-19 disease created by Our World In Data.
Design and implementation in MapReduce of:
- a job returning the classification of continents in order of decreasing total_cases
- a job returning, for each location, the average number of new-tests per day
- a job returning the 5 days in which the total number of patients in Intensive care units (ICUs) and hospital was highest (icu_patients+ hosp_patients), by decreasing order of this total number.
All the MapReduce jobs will be executed with different size of input values and with different Hadoop configurations (local, pseudo-distributed without YARN and pseudo-distributed with YARN) to evaluate the different execution times.
I'm a computer science Master's Degree student and this is one of my university project. See my other projects here on GitHub!
The dataset is a collection of the COVID-19 data manteined and updated daily by Our World In Data and contains data on confirmed cases, deaths, hospitalizations, and other variables of potential interest.
The dataset available in different formats can be found here, while the data dictionary useful to understand the meaning of all the dataset's columns is available here.
Implementation of a MapReduce job that returns the classification of continents in order of decreasing total_cases.
Before the Map phase there is a Splitting phase where the raw lines of data contained in the dataset are divided in “single blocks of data” using the “,” value to separate all the different values and the “\t” value to separate all the different rows of the dataset.
Input: all the values contained inside each row of the dataset and separated by ","
Output: continent of each nation and the number of total cases (take only the information on the last day of the month (e.g. if your input data contains info only on March-April period, to take the number of total cases you have to take only the info on 30th of April, because the number of total_cases in the dataset is a cumulative number based on the sum of the cases on the previous days))
The MapReduce framework processes the output of the map function before sending it to the reducer function. Map outputs are divided into groups.
Input: output of the map function
Output: each continent appear with a list of all of its total_cases
Input: output of the shuffling phase
Output: total number of cases for each continent
The final result is composed by a key-value pair made of KEY = name of continent, VALUE = number of total cases for that continent.
At the end there will be a sorting of this key-value pairs to show the results in order of decreasing of total cases.
The execution of this job has been done in all 3 Hadoop configurations (Local and Pseudo-distributed, with and without YARN).
Three different types of input data have been used. Results changes with respect to the inputs:
Implementation of a MapReduce job that returns for each location the average number of new-tests per day
Input: all the values contained inside each row of the dataset and separated by ","
Output: location and number of new_tests per day
Input: output of the map function
Output: each location appear with a list of all of its new_tests
Input: output of the shuffling phase
Output: total number of new_tests for each location and divide the result for the number n of days
The final result is composed by a key-value pair made of KEY = name of location, VALUE = average number of new_tests for each location.
The execution of this job has been done in all 3 Hadoop configurations (Local and Pseudo-distributed, with and without YARN).
Three different types of input data have been used. Results changes with respect to the inputs:
Implementation of a MapReduce job that returns the classification of the 5 days in which the total number of patients in Intensive Care Unit (ICUs) and hospital (icu_patients + hosp_patients) was highest, by decreasing order of this total number
Input: all the values contained inside each row of the dataset and separated by ","
Output: date and number of icu_patients + hosp_patients per day
Input: output of the map function
Output: each date appear with a list of all of its icu_patients + hosp_patients
Input: output of the shuffling phase
Output: sum of icu_patients + hosp_patients for each date
The final result is composed by a key-value pair made of KEY = date, VALUE = number of icu_patients + hosp_patients for that date.
The execution of this job has been done in all 3 Hadoop configurations (Local and Pseudo-distributed, with and without YARN).
Three different types of input data have been used. Results changes with respect to the inputs:
For every job some tabular and graphical comparison of job's execution times in local and pseudo-distributed configurations with and without YARN have been computed. Obviously all these execution times are taken even considering the different input sizes used.
-
As we expect, the fastest Hadoop configuration is the Local (or standalone mode) and this because in this configuration Hadoop runs in a single JVM and uses the local filesystem instead of the Hadoop Distributed Fyle System (HDFS). Furthermore in this configuration the job will be run with one mapper and one reducer and this guarantees a good speed (not true for too large input data) and it is why Local configuration is used particularly for the code debugging and testing.
-
On the other way Pseudo-distributed mode is used to simulate the possible behavior of distributed computation, but it’s done by running Hadoop daemons in different JVM instances on a single machine. This configuration uses HDFS instead of the local filesystem and there can be multiple mappers and multiple reducers to run the job and this increments the execution times.
-
These times really increase when the job is ran in Pseudo-distributed mode using YARN because the map and reduce times sum up to the times due to resource managing.
For any support, error corrections, etc. please email me at [email protected]