Final Result
https://github.com/sangje-lee/non-profit-org-employment
- Export csv file into the big dataset.
- Filtered some columns/attributes and removed null values that are founded.
- Division into different datasets based on the Indicators (There's should be seven datasets)
- Division into four different datasets based on the year. Contains three years worth of data (2010-2012, 2013-2015, 2016-2018, 2019-2021)
- Division into four different characteristics into four dataasets.
- Division based on the GEO, provinces.
- df - Whole dataset without any filtering or division
- df_sorted - Whole dataset with any filtering like removing non-important attributes.
- df_sorted_na - Whole dataset with removal of the null values inside the dataset.
Division of into new dataset based on Indicator
- df_AvgAnnHrsWrk - Average annual hours worked
- df_AvgAnnWages - Average annual wages and salaries
- df_AvgHrsWages - Average hourly wage
- df_AvgWeekHrsWrked - Average weekly hours worked
- df_Hrs_Wrked - Hours Worked
- df_NumOfJob - Number of jobs
- df_WagesAndSalaries - Wages and Salaries
Division of into new dataset based on the GEO/year
- df_AvgAnnHrsWrk_2010 - Average annual hours worked in 2010
- df_AvgAnnHrsWrk_2013 - Average annual hours worked in 2013
- df_AvgAnnHrsWrk_2016 - Average annual hours worked in 2016
- df_AvgAnnHrsWrk_2019 - Average annual hours worked in 2019
- training_df_AvgAnnHrsWrk - Average annual hours worked for training set (2013-2018)
- testing_df_AvgAnnHrsWrk - Average annual hours worked for testing set (2019-2021)
- df_AvgAnnHrsWrk_below_2016 - Average annual hours worked below 2016
- df_AvgAnnHrsWrk_above_2017 - Average annual hours worked above 2017
Division of into new dataset based on the group of Characteristics
- testing_df_WagesAndSalaries_ByAge - Wages and Salaries By Age For Testing set
- testing_df_WagesAndSalaries_ByGender - Wages and Salaries By Gender Group For Testing set
- testing_df_WagesAndSalaries_ByEducation - Wages and Salaries By Education level For Testing set
- testing_df_WagesAndSalaries_ByImmigrant - Wages and Salaries By Immigrant level For Testing set
- testing_df_WagesAndSalaries_ByIndigenous - Wages and Salaries By Indigenous status For Testing set
Division of into new dataset based on the provinces
- testing_df_AvgAnnHrsWrk_ByAge_Provinces - Average annual hours worked for testing set by age group grouped by provinces
- testing_df_AvgAnnHrsWrk_ByGender_Provinces - Average annual hours worked for testing set by gender grouped by provinces
- testing_df_AvgAnnHrsWrk_ByEducation_Provinces - Average annual hours worked for testing set by education level grouped by provinces
- testing_df_AvgAnnHrsWrk_ByImmigrant_Provinces - Average annual hours worked for testing set by immigrant status grouped by provinces
- testing_df_AvgAnnHrsWrk_ByIndigenous_Provinces - Average annual hours worked for testing set by indigenous status grouped by provinces
ProvinceAnalysis(df_AvgAnnHrsWrk_201x_ByAge, pd, np, pp) - Create new object using ProvinceAnalysis using datasets and other necessary part.
Variables:
- self.df = Dataset, the dataset that import
- self.provinces = array of provinces
- self.indicators = array of indicators
- self.characteristics = array of characteristics
- self.year = array of years being analysis
- self.dfProvinces = array of analysis based of division by provinces, do analysis from the df Dataset
- outputAnalysis(province_id) - Output detail analysis including sum, mean, and skewness.
- outputAnalysisSimple(province_id) - Summarized the output details.
- outputList(province_id, num) - Output first "num" amount of dataset.
- outputPandaProfiling(province_id) - Do Panda profiling for specific provinces in specific year.
Province Code [0-13]:
['Alberta', 'BC', 'GEO = Canada' , 'Manitoba' , 'New Brunswick', 'Newfoundland', 'Northwest Territories' , 'Nova Scotia' , 'Nunavut', 'Ontario' , 'PEI', 'Quebec', 'Saskatchewan', 'Yukon']
OutputProvinceAnalysis(df_AvgAnnHrsWrk_201x_ByAge_Provinces, ProCode, "201x", pd, np, pp) - Create new object using ProvinceAnalysis using dataset and other necessary part.
- ProCode is code for the provinces mentions above.
- "201x" here is the year of the analysis.
- self.df_output - dataset that are analyzing
- self.ProCode - province to analysis (in numeric code)
- self.YearOutput - year that was analyized (more for panda-profiling)
- OutputResult(self) - Display the result that was analyzed.
- OutputPandaProfiling(self) - Do Panda Analysis in specific provinces
For first input (variable categorized_province),
Input the province to analysis, full province name required. Otherwise, error sign will rise.
For second input,
From the numeric code below from 0 - 6 (variable list_indicator),
- "0. Average annual hours worked"
- "1. Average annual wages and salaries"
- "2. Average hourly wage"
- "3. Average weekly hours worked"
- "4. Hours Worked"
- "5. Number of jobs"
- "6. Wages and Salaries"
Input the indicators required, numerics sign required, if not prompted, it will raise error.
- Data_Anlaysis_x - Contain last modified work. Last one is Data_Analysis_v07.
- 36100651-eng.zip - Contain original dataset employment of non-profit organizations.
- 36100651.csv - Contain original dataset employment of non-profit organizations in csv file.
- EDA_Report_v00.pdf - Inital EDA Report before spliting dataset
- data_analysis_categorized_technical_report.ipynb - Contain techncial report in Jupiter Notebook
- data_analysis_categorized_technical_report.py - contain technical report in Python file.
- data_analysis_categorized_technical_report.html - contain technical report in html file.
- data_analysis_categorized_technical_report.pdf - contain technical report in pdf file.