Skip to content

Latest commit

 

History

History
53 lines (32 loc) · 2.84 KB

File metadata and controls

53 lines (32 loc) · 2.84 KB

Final Project Overview

To answer a data driven question by deploying the data science lifecycle. At the end of the process, you will have an appreciation for what it is like to develop your own questions, match these questions with data, test the ability of the data via a model to answer your question, develop a final solution and the articulation of the weaknesses and benefits of the approach. The goal is for you to pull all the skills together that we have worked on all semester, essentially living the life of a practicing data scientist! This is also a great way to start developing a data project portfolio on your repo or website.

Details:

Work with your lab groups to develop and answer a discrete question related to a dataset of your choosing.
Some dataset resources are list below for you to potential use, but you are also welcomed to use a dataset you have used in the past or is not a part of the listed resources (not from class). Present a cleanly knitted final presentation that walks the reader through your project step by step. This means you need to reference the data science lifecycle and work through each stage deliberately.

  • kNN
  • Clustering - Kmeans
  • Decision Trees
  • Random Forrest

Generate a publishable Rmarkdown document with the following sections:

  • Question and background information on the data and why you are asking this question(s). References to previous research/evidence generally would be nice to include. – You must present your question to me during office hours, either next week on 26th or the following week on the 3rd

  • Exploratory Data Analysis – Initial summary statistics and graphs with an emphasis on variables you believe to be important for your analysis.

  • Methods – Techniques you are using to address your question and the results of those methods.

  • Evaluation of your model – Select appropriate metrics and explain the output as it relates to your question.

  • Fairness assessment – if necessary, should you happen to have any protected classes.

  • Conclusions – What can you say about the results of the methods section as it relates to your question given the limitations to your model.

  • Future work – What additional analysis is needed or what limited your analysis on this project.

Publish the final html to Rpubs or create a github page (website) that sits “on top” of your repo using Github’s internal “git pages” tool.

Potential Data Sources:

Google Dataset Search: https://datasetsearch.research.google.com/

Covid 19 - https://github.com/XinerNing/CGDV.github.io/blob/master/dataSource/index.md

data.world - https://data.world/

UCI ML Repository - http://archive.ics.uci.edu/ml/index.php

Data is Plural - https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0