-
-
Notifications
You must be signed in to change notification settings - Fork 17
Intro to Data Analysis With Python
In the wild, data is dirty and data is often disorganized. In order to make a resource useful you may need to filter it, modify it, and combine it with other resources. To arrive at your final useful dataset, these operations may need to be performed thousands of times in sequence, this is where Python comes in handy. You may also need to retrace your steps to do it all again in order to teach somebody else to do it or in the case where your hard drive crashes. (back up your hard drive)
Data tools in Python can allow you to create organized sets of operations called "data pipelines" to start from nothing and assemble and clean data to arrive at a dataset you can actually use.
This tutorial will begin with a brief guide to loading datasets into Python, followed by an introduction to exploratory data analysis, and conclude with steps to clean and modify data. After this tutorial, you should be prepared to clean and manipulate raw datasets.
This tutorial assumes that you have knowledge of package managers in python like conda
and pip
and a basic understanding of working in Python (how to write if/else/while statements, assign variables, etc) and at least a passing familiarity with the most commonly used tools in its associated standard library.
Jupyter Notebooks are a useful interface for doing data analysis. Most practitioners of Pandas use Jupyter at least a little bit since the two tools are very well integrated and it makes things look pretty nice and makes live coding a much cleaner exercise.
- Jupyter Install Tutorial How to install and get started with Jupyter
- Jupyter Notebook Tutorial An overview of how to use the high level interface and keybindings for Jupyter notebooks.
Pandas is the workhorse of Python data analysis. Its dataframe data structure makes available a huge variety of tools. In addition Pandas is supported by a great variety of packages in Python for specialized data analysis and machine learning, which makes it a valuable core competency.
- Official Pandas Tutorial Up to date and well maintained tutorial focused on getting you up to speed and running quickly
- Daniel Chen Pandas Tutorial Good in-depth video walkthrough showing a full data analysis with explanations
- Brandon Rhodes Pandas Tutorial Considered by many people the definitive intro to pandas. Be aware that some small changes have happened to the way pandas works since this was filmed, so you may need to google if the code examples don't work exactly as shown.
EDA is the process of investigating a new dataset and cataloguing its features. Broadly it's the process of getting to know your data, getting it in the right format, and identifying any inconsistencies it might have. EDA should always be your first step when you get a new dataset, even if it's brief. Otherwise your conclusions may not mean what you think they do.
EDA is very personalized and is really all about learning to think deeply about a new dataset and cover your bases in a methodical way while keeping an eye out for any interesting trends. The below are provided as examples, but none are an authoritative workflow.
- A General Intro To EDA A conceptual introduction to the thought process of EDA.
- YouTube EDA Example A quick investigation of a dataset.
- Another YouTube EDA Example
- Kaggle EDA Example One example of an EDA process with executable code. There are MANY notebooks on Kaggle that involve an EDA, it's a good idea to google around and see how other people have approached the thought process.
import pandas as pd
pd.read_csv('#name of csv filepath')
*Attribute: A value associated with an object or class which is referenced by name using dot notation.
*Method: A function that belongs to a class and typically performs an action or operation.
df.head() returns first rows of dataframe.
df.info() summarizes dataframe.
df.describe() returns descriptive statistics of dataframe (mean, median, Q1, Q3).
df.shape returns tuple with shape of dataframe (ex: (2,3) for a dataframe with 2 rows and 3 columns).
df.size returns number of cells in dataframe.
NumPy is the library that underlies most Python data tools. It is more granular and allows many optimized mathematical operations for working with large arrays. It is especially useful for performing linear algebra operations like matrix multiplies, which are ubiquitous in machine learning and deep learning. Pandas is based on NumPy, and many of its data structures and operations act the way they do because they are built on top of NumPy's code and philosophy. For a deeper understanding of how to manipulate data, a working knowledge of NumPy can be very powerful.
- Official NumPy Tutorial How to get up and running with NumPy
- NumPy Illustrated Graphical guide to NumPy with some good visualized explanations of how things work.
- NumPy For Your Grandma From scratch tutorial covering the fundamental NumPy operations and data structures.
- FreeCodeCamp Tutorial Video covering high level operations in NumPy and using Numpy data structures.
Once you have your data organized there are a number of options for doing data processing, drawing statistical conclusions, or building machine learning models. Explaining the inner workings and theory of these packages is beyond the scope of this tutorial, but if you want to investigate they are very powerful and useful tools. In some cases they can be useful for basic tasks like finding outliers or performing similar tasks using statistics-guided approaches.
- scikit-learn The standard for performing general machine learning and testing tasks in Python.
- statsmodels statsmodels includes various specialized statistical techniques and basic techniques with more comprehensive human readable output than scikit-learn. Useful for frequentist statistics tasks.
- sciPy is useful for performing optimized numeric operations
Ryan Swan