This repository holds scripts and notebooks for Steve's musings, investigations, case studies, animations, and slides.
Here's a high-level snapshot of each script.
| File | Language | Dataset | Package | Notes |
|---|---|---|---|---|
NB.R |
R | NaiveBayes.csv |
e1071 |
Simple example of NB. |
arules.Rmd |
R | arules::Groceries |
arules, arulesViz |
|
bigdata.Rmd |
R | N/A | tidyverse |
Just some charts for the big data slides. |
classifiers.R |
R | laheart.csv |
rpart, e1071, MLmetrics |
Compares NB and DT. |
intro.Rmd |
R | gapminder |
tidyr, dplyr, ggplot2 |
An intro to R and the tidyverse. |
recSys.R |
R | recommenderlab::MovieLense |
recommenderlab |
Recommendation system for Movie Lense data. Uses CF. |
slide_plots.Rmd |
R | chirps.csv, Prestige.txt, clusters.csv |
tidytext, tm, tidyverse |
Just a script to create some plots/charts I've used in slides. |
spark-sample.mdR |
R | nycflights13, Lahman |
sparklyr |
Simple of example of how to use sparklyr. |
sql.Rmd |
R | customer.csv, transaction.csv |
sqldf |
Shows how to use the sqldf package. Used for some of my slides on SQL. |
sqlChallenge.Rmd |
R | Lahman |
sqldf |
Used for creating the SQL challenge. |
titanic.Rmd |
R | titanic |
tidyverse, rpart, MLmetrics |
Titanic case study. Builds a DT to predict survival. |
| File | Language | Dataset | Package | Notes |
|---|---|---|---|---|
cluster_20.ipynb |
Python | sklearn.datasets::20newsgroups |
nltk, sklearn |
Clustering the 20 Newsgroup dataset. |
imdb.Rmd |
R | all.imdb.pipe.csv |
tidytext, cleanNLP, tm |
Classifying IMDB data. |
kiva.Rmd |
R | kiva.csv |
tidytext, topicmodels, rpart, MLmetrics |
Classifying KIVA loans. Used as a case study. |
nltk-cluster.py |
Python | sklearn.datasets::20newsgroups |
nltk, sklearn |
I'm not sure how this is different from cluster_20.ipynb |
sentiment-manning.Rmd |
R | manning.csv, brady.csv |
tidytext |
Sentiment analysis on tweets about Peyton Manning and Tom Brady. |
slides_sentiment.R |
R | N/A | tidytext |
Just a script to do some simple tidy-based sentiment analysis on some made-up data. |
slides_text_amazon.Rmd |
R | reviews_Grocery_and_Gourmet_Food_5_50000.csv |
tidytext, tm, wordcloud |
Descriptive stats on Amazon Reviews (Food category). |
slides_text_amazon_classify.R |
R | reviews_Grocery_and_Gourmet_Food_5_50000.csv |
tidytext, tm, caret |
Classifying Amazon reviews. |
slides_text_reuters.Rmd |
R | reutersCSV.csv |
tidytext, tm, wordcloud |
Descriptive stats on Reuters dataset. |
Note: the source isn't actually "Unknown" for most of the data files below. I just haven't done it yet.
| File | Source |
|---|---|
HR_comma_sep.csv |
Unknown |
Master.csv |
Unknown |
NaiveBayes.csv |
Unknown |
Prestige.txt |
Unknown |
Salaries.csv |
Unknown |
all.imdb.pipe.csv |
Unknown |
alltweets.csv |
Unknown |
beta.csv |
Unknown |
beta_12.csv |
Unknown |
chirps.csv |
Unknown |
clusters.csv |
Unknown |
customer.csv |
Unknown |
gamma.csv |
Unknown |
gamma_12.csv |
Unknown |
jackastors.csv |
Unknown |
kiva..csv |
Unknown |
laheart.csv |
Unknown |
laheart2.csv |
Unknown |
site.csv |
Unknown |
student.csv |
Unknown |
survey.csv |
Unknown |
topicnames_12.csv |
Unknown |
transaction.csv |
Unknown |
visited.csv |
Unknown |
groceries.csv |
Unknown |
loan_small.csv |
Unknown |
all.imdb.pipe.csv |
Unknown |
brady.csv |
Unknown |
manning.csv |
Unknown |
reutersCSV.csv |
Unknown |
reviews_Grocery_and_Gourmet_Food_5_50000.csv |
Unknown |