Skip to content

Commit

Permalink
project commit
Browse files Browse the repository at this point in the history
added project report
  • Loading branch information
mohanramunc committed Dec 5, 2022
1 parent 94d6bab commit 4a23e47
Show file tree
Hide file tree
Showing 9 changed files with 464 additions and 121 deletions.
1 change: 1 addition & 0 deletions .password
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
pwd
45 changes: 45 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
FROM rocker/verse
RUN R -e "install.packages(\"caret\")";
RUN R -e "install.packages(\"tidyverse\")";
RUN R -e "install.packages(\"scales\")";
RUN R -e "install.packages(\"ggthemes\")";
RUN R -e "install.packages(\"repr\")";
RUN R -e "install.packages(\"remotes\")";
RUN R -e "install.packages(\"reticulate\")";
RUN R -e "install.packages(\"GGally\")";
RUN R -e "install.packages(\"r2d3\")";
RUN R -e "install.packages(\"svglite\")";
RUN R -e "install.packages(\"stringr\")";
RUN R -e "install.packages(\"ggplot2\")";


RUN R -e "install.packages(\"Seurat\")";
RUN R -e "install.packages(\"BiocManager\")";
RUN R -e "install.packages(\"gbm\")";
RUN R -e "install.packages(\"here\")";
RUN R -e "install.packages(\"tinytex\")";
RUN R -e "install.packages(\"rmarkdown\")";
RUN R -e "install.packages(\"shiny\")";
RUN R -e "install.packages(\"plotly\")";
RUN R -e "install.packages(\"data.table\")";
RUN R -e "BiocManager::install(\"GEOquery\");"
RUN Rscript --no-restore --no-save -e "tinytex::tlmgr_install(c(\"wrapfig\",\"ec\",\"ulem\",\"amsmath\",\"capt-of\"))"
RUN Rscript --no-restore --no-save -e "tinytex::tlmgr_install(c(\"hyperref\",\"iftex\",\"pdftexcmds\",\"infwarerr\"))"
RUN Rscript --no-restore --no-save -e "tinytex::tlmgr_install(c(\"kvoptions\",\"epstopdf\",\"epstopdf-pkg\"))"
RUN Rscript --no-restore --no-save -e "tinytex::tlmgr_install(c(\"hanging\",\"grfext\"))"
RUN Rscript --no-restore --no-save -e "tinytex::tlmgr_install(c(\"etoolbox\",\"xcolor\",\"geometry\"))"
RUN Rscript --no-restore --no-save -e "install.packages(c(\"plumber\"))"
RUN Rscript --no-restore --no-save -e "install.packages(c(\"verification\"))"
RUN Rscript --no-restore --no-save -e "update.packages(ask = FALSE);"
RUN R -e "BiocManager::install(\"cowplot\");"
RUN R -e "BiocManager::install(\"patchwork\");"
RUN R -e "BiocManager::install(\"limma\");"
RUN R -e "BiocManager::install(\"openxlsx\");"
RUN R -e "BiocManager::install(\"ggplot2\");"
RUN R -e "BiocManager::install(\"knitr\");"
RUN R -e "BiocManager::install(\"MAST\");"
RUN R -e "BiocManager::install(\"mclust\");"
RUN R -e "BiocManager::install(\"matrixStats\");"
RUN R -e "BiocManager::install(\"ggpubr\");"
RUN R -e "BiocManager::install(\"SingleCellExperiment\");"
RUN R -e "BiocManager::install(\"pheatmap\");"
35 changes: 0 additions & 35 deletions Dockerfile.txt

This file was deleted.

33 changes: 0 additions & 33 deletions Makefile.txt

This file was deleted.

41 changes: 17 additions & 24 deletions Readme.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,19 @@
Covid, A catalyst for new job roles ?
========================================================

https://www.kaggle.com/datasets/jackogozaly/data-science-and-stem-salaries
Data sci job analysis in 2017-2021 ?
========================================================================================
(dataset in source_data folder.)

I want to study how the job roles have evolved over the four-year period of the dataset ( 2017-2021).

Over the past couple of years, It has been widely hyped that process automation and machine learning implementation in industrial workflow would cannibalize job roles and bring in demand for new roles while rendering traditional job roles obsolete. User Experience roles in design and research are currently the hyped roles and are considered the next big thing as data science roles were previously held. There are also unforeseen factors like the advent of covid and issues of ethical irresponsibility by corporate giants. It is my opinion that studying this dataset can reveal insights into these topics.

One of the questions I have is how has covid affected the cities people take up jobs in. How the distribution of job roles and their cities have changed since the advent of covid and the adoption of remote work. I hypothesize that more people have taken up roles in smaller cities owing to the lower living expenses since covid. Do social media trends that advocate this hold any truth? Do people still choose to work at the headquarters?

Are there any new job roles that have been introduced in recent years? Have any of the past roles been rendered obsolete or witnessing a decline in the number of new hires? Any impact of automation on this phenomenon?

What are the new fields that are trending in recent years? Have there been any new fields since the advent of covid? What roles are now witnessing an increasing demand since the advent of covid?
Over the past couple of years, It has been widely hyped that process automation and machine learning implementation in industrial workflow would cannibalize job roles and bring in demand for new roles while rendering traditional job roles obsolete. There are also unforeseen factors like the advent of covid and economic shifts in corporate giants. It is my opinion that studying this dataset can reveal insights into these topics.

Are there any differences in role hiring patterns and demand before and since the advent of covid? How has the rating of the companies changed since the wider adoption of remote work due to covid?
One of the questions I have is how Compensation varies with the Years of Experience. How the distribution of job roles and their companies have changed over the years and study Most Common Job Titles. what are the Top Companies For Data Scientists? study the Racial Makeup For Data Scientist Roles and their Compensation breakup By Years of Experience and in general. what is the Education Makeup For Data Scientist Roles and how does Compensation For Data Scientists vary with Education?

Has there been any increase in the roles that oversee ethical responsibilities?
Best Companies For Data Scientists? Best Companies For Software Engineers?

How do salaries differ over fields? Has base salary increased over the 4 year period? What roles now offer lucrative salaries? Have data science roles reached saturation? How is the demand for UX roles in companies?

These are some of the questions I hope to get insight into. Answers to these questions will help me and my fellow students to make better-informed decisions in the future.

I hope to use the tools that are thought in Bios 611 course to work on the dataset to generate the needed insights.
I hope to use the tools thought in Bios 611 course to work on the dataset to generate the needed insights.


=====================================================================================================================
Expand All @@ -32,28 +24,29 @@ Getting Started

Build the docker image by typing:
```
docker build . -t project611
docker build . -t 611project
```
Create a .password file containing the password.
default is pwd

And then start an RStudio by typing:

```
docker run -v $(pwd):/home/rstudio/project -p 8787:8787 -e PASSWORD=<some-password>
docker run -v $(pwd):/home/rstudio/work -p 8787:8787 -e PASSWORD="$(cat .password)" -it 611project
```

Once the Rstudio is running connect to it by visiting
https://localhost:8787 in your browser.

To build the final report, visit the terminal in RStudio and type
change to directory work and run the below commands in the terminal in RStudio to build the report:

```
cd work
make clean
make report.pdf
Alternatively run
```
docker run -v $(pwd):/home/rstudio/project\
--user="rstudio" --workdir="/home/rstudio/project" -t project611\
make report.pdf

```
187 changes: 187 additions & 0 deletions datajob.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@


library(tidyverse)
library(reticulate)
library(stringr)
library(dplyr)
library(scales)
library(ggthemes)
library(ggplot2)
library(here)


salary_data <- read_csv("~/work/source_data/data.csv")


proj_plot <- salary_data %>%
ggplot() + aes(x= totalyearlycompensation) + geom_histogram(bins= 100, fill = "lightblue", color="darkblue") + scale_y_continuous(labels=comma) +
theme(plot.title = element_text(hjust=.5), axis.title = element_text(size=15, face="bold"), axis.text = element_text(size=12)) + xlab("\nTotal Yearly Comp") + ylab("Count\n") +
ggtitle("Total Compensation Distribution") + scale_x_continuous(labels=scales::dollar_format())

ggsave("~/work/figures/Total_Compensation_Distribution.png", plot=proj_plot);

#Salary By Years of Experience
exppay_plot <- salary_data %>%
ggplot() + aes(x= yearsofexperience, y= totalyearlycompensation) + geom_point(size=.5) +
scale_y_continuous(labels=scales::dollar_format()) + theme_fivethirtyeight(base_size=12) +
theme(axis.title = element_text(size=15, face='bold'), axis.text= element_text(size=12), plot.title = element_text(hjust=.5)) + xlab("\n Years of Experience") + ylab("Total Yearly Compensation\n") +
ggtitle("Compensation By Years of Experience")


ggsave("~/work/figures/Compensation_By_Years_of_Experience.png", plot=exppay_plot);


#Record Per Company
compstat <- salary_data %>%
count(company, sort= T) %>%
head(n=15) %>%
ggplot() + aes(x= reorder(company, n), y=n) + geom_col(fill='#0E5E8E') + coord_flip() +
theme_fivethirtyeight(base_size=12) + theme(plot.title= element_text(hjust = .5),
axis.title = element_text(size=15, face='bold'),
axis.text= element_text(size=12)) +
ylab("\n Count") + xlab("Company\n") + ggtitle("Number of Salaries Recorded Per Company") +
scale_y_continuous(labels=comma)

ggsave("~/work/figures/Number_of_Salaries_Recorded_Per_Company.png", plot=compstat);

#Record of Titles
titleplot <- salary_data %>%
count(title, sort= T) %>%
head(n=10) %>%
ggplot() + aes(x= reorder(title, n), y=n) + geom_col(fill='#0E5E8E') + coord_flip() +
theme_fivethirtyeight(base_size=12) + theme(plot.title= element_text(hjust = .5), axis.title = element_text(size=15, face='bold'),
axis.text= element_text(size=12)) +
ylab("\n Count") + xlab("Title\n") + ggtitle("Most Common Job Titles") +
scale_y_continuous(labels=comma)

ggsave("~/work/figures/Most_Common_Job_Titles.png", plot=titleplot);



#Top Companies For Data Scientists
topcomp <- salary_data %>%
filter(title== "Data Scientist")%>%
count(company, sort= T) %>%
head(n=15) %>%
ggplot() + aes(x= reorder(company, n), y=n) + geom_col(fill='#0E5E8E') + coord_flip() +
theme_fivethirtyeight(base_size=12) + theme(plot.title= element_text(hjust = .5), axis.title = element_text(size=15, face='bold'),
axis.text= element_text(size=12)) +
ylab("\n Count") + xlab("Company\n") + ggtitle("Top Companies For Data Scientists")

ggsave("~/work/figures/Top_Companies_For_Data_Scientists.png", plot=topcomp);

#Racial breakdown
byrace <- salary_data %>%
filter(title== "Data Scientist")%>%
drop_na(Race) %>%
count(Race, sort= T) %>%
mutate(percent = n/sum(n))%>%
ggplot() + aes(x= reorder(Race, percent), y=percent) + geom_col(fill='#0E5E8E') + coord_flip() +
theme_fivethirtyeight(base_size=12) + theme(plot.title= element_text(hjust = .5), axis.title = element_text(size=15, face='bold'),
axis.text= element_text(size=12)) +
ylab("") + xlab("") + ggtitle("Racial Makeup For Data Scientist Roles") +
scale_y_continuous(labels = scales::percent, breaks = pretty_breaks(8))
ggsave("~/work/figures/Racial_Makeup_For_Data_Scientist_Roles.png", plot=byrace);

#Salary for Data By YOE and Race
datapay <- salary_data %>%
filter(title== "Data Scientist")%>%
drop_na(Race) %>%
ggplot() + aes(x= yearsofexperience, y= totalyearlycompensation, group= Race, color= Race) + geom_point(size=2) +
scale_y_continuous(labels=scales::dollar_format()) + theme_fivethirtyeight(base_size=12) +
theme(axis.title = element_text(size=15, face='bold'),
axis.text= element_text(size=12), plot.title = element_text(hjust=.5),legend.text = element_text(size=12),
legend.title = element_text(size=12)) + xlab("\n Years of Experience") + ylab("Total Yearly Compensation\n") +
ggtitle("Compensation By Years of Experience")+ guides(colour = guide_legend(override.aes = list(size=6)))
ggsave("~/work/figures/Compensation_By_Years_of_Exp.png", plot=datapay);

dataedu <- salary_data %>%
filter(title== "Data Scientist")%>%
drop_na(Education) %>%
count(Education, sort= T) %>%
mutate(percent = n/sum(n))%>%
ggplot() + aes(x= reorder(Education, percent), y=percent) + geom_col(fill='#0E5E8E') + coord_flip() +
theme_fivethirtyeight(base_size=12) + theme(plot.title= element_text(hjust = .5), axis.title = element_text(size=15, face='bold'),
axis.text= element_text(size=12)) +
ylab("\n Percent") + xlab("Education\n") + ggtitle("Education Makeup For Data Scientist Roles") +
scale_y_continuous(labels = scales::percent, breaks = pretty_breaks(8))

ggsave("~/work/figures/Education_Makeup_For_Data_Scientist_Roles.png", plot=dataedu);

#Compensation By Education
give.n <- function(x){
return(c(y = median(x)*1.05, label = length(x)))
# experiment with the multiplier to find the perfect position
}
datatop <- salary_data %>%
filter(title== "Data Scientist")%>%
drop_na(Education) %>%
ggplot() + aes(x= reorder(Education, -totalyearlycompensation), y= totalyearlycompensation, group= Education, color= Education) + geom_boxplot(size=1.3) +
scale_y_continuous(labels=scales::dollar_format()) + theme_fivethirtyeight(base_size=12) +
theme(axis.title = element_text(size=15, face='bold'),
axis.text= element_text(size=12), plot.title = element_text(hjust=.5),legend.position="none", plot.subtitle = element_text(hjust = 0.5, size= 14)) +
xlab("\n Education") + ylab("Total Yearly Compensation\n") +
ggtitle("Compensation For Data Scientists By Education", subtitle = "Distribution for each education level with a count of observations per group")+ guides(colour = guide_legend(override.aes = list(size=1.5)))+
stat_summary(fun.data = give.n, geom = "text", fun = median,vjust = 0)

ggsave("~/work/figures/compensation_For_Data_Scientists_By_Education.png", plot=datatop);



companies <- salary_data %>%
filter(title== "Data Scientist")%>%
count(company, sort= T) %>%
head(n=15)
companies <- companies$company

datacomptop <-salary_data %>%
filter(title== "Data Scientist")%>%
filter(company %in% companies) %>%
group_by(company) %>%
summarise(average_comp= mean(totalyearlycompensation))%>%
ggplot() + aes(x= reorder(company,average_comp), y= average_comp) + geom_col(size=2, fill='#0E5E8E') + coord_flip() +
scale_y_continuous(labels=scales::dollar_format()) + theme_fivethirtyeight(base_size=12) +
theme(axis.title = element_text(size=15, face='bold'),
axis.text= element_text(size=12), plot.title = element_text(hjust=.5),legend.text = element_text(size=10),
legend.title = element_text(size=12)) + xlab("") + ylab("\nTotal Yearly Compensation") +
ggtitle("Avg Data Scientist Total Compensation")

ggsave("~/work/figures/avg_Data_Scientist_Total_Compensation.png", plot=datacomptop);

datacompdist <- salary_data %>%
filter(title== "Data Scientist")%>%
filter(company %in% companies) %>%
ggplot() + aes(x= reorder(company,totalyearlycompensation), y= totalyearlycompensation) + geom_boxplot(size=1, fill='#0E5E8E', color='black') + coord_flip() +
scale_y_continuous(labels=scales::dollar_format()) + theme_fivethirtyeight(base_size=12) +
theme(axis.title = element_text(size=15, face='bold'),
axis.text= element_text(size=12), plot.title = element_text(hjust=.5),legend.text = element_text(size=10),
legend.title = element_text(size=12)) + xlab("") + ylab("\nTotal Yearly Compensation") +
ggtitle("Data Scientist Total Compensation Distribution")


ggsave("~/work/figures/Data_Scientist_Total_Compensation_Distribution.png", plot=datacompdist);


companies <- salary_data %>%
filter(title== "Software Engineer")%>%
count(company, sort= T) %>%
head(n=15)
companies <- companies$company

sdecomp <- salary_data %>%
filter(title== "Software Engineer")%>%
filter(company %in% companies) %>%
group_by(company) %>%
summarise(average_comp= mean(totalyearlycompensation))%>%
ggplot() + aes(x= reorder(company,average_comp), y= average_comp) + geom_col(size=2, fill='#0E5E8E') + coord_flip() +
scale_y_continuous(labels=scales::dollar_format()) + theme_fivethirtyeight(base_size=12) +
theme(axis.title = element_text(size=15, face='bold'),
axis.text= element_text(size=12), plot.title = element_text(hjust=.5),legend.text = element_text(size=10),
legend.title = element_text(size=12)) + xlab("") + ylab("\nTotal Yearly Compensation") +
ggtitle("Avg Software Engineer Total Compensation")



ggsave("~/work/figures/Avg_Software_Engineer_Total_Compensation.png", plot=sdecomp);


Loading

0 comments on commit 4a23e47

Please sign in to comment.