GitHub

This project used my assigned NLP project as a starting point to determine if I could improve the model. Resampling provided the greatest improvement in prediction while accuracy increased only 3%, the average F1 score increased 14%.

Scraped 1K GitHub repositories with Harry Potter in the search
Filtered out those without a readme file or program language
Filtered for only the top 4 represented program languages: JavaScript, Java, HTML, Python
Cleaned and prepared data using regex and toktoktokenizer
Removed common english stop words and additional words found to create noise in dataset
Vetorized data using TF-IDF
Initial attempts to improve model by adding data, reducing noise, and testing multiple algorithms did not significantly increase accuracy
Determined imbalanced dataset might be contributing factor and Resampled data with SMOTE
This only improved the average model accuracy by 3%, but improved the average F1 score by 14%

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
acquire.py		acquire.py
explore.py		explore.py
model.py		model.py
nlp_HP_readme.ipynb		nlp_HP_readme.ipynb
prepare.py		prepare.py
readme.md		readme.md
wrangle.py		wrangle.py