Skip to content

Intro to Data Analysis With Python

zaklang123 edited this page Aug 14, 2023 · 45 revisions

Overview

In the wild, data is dirty and data is often disorganized. In order to make a resource useful you may need to filter it, modify it, and combine it with other resources. To arrive at your final useful dataset, these operations may need to be performed thousands of times in sequence, this is where Python comes in handy. You may also need to retrace your steps to do it all again in order to teach somebody else to do it or in the case where your hard drive crashes. (back up your hard drive)

Data tools in Python can allow you to create organized sets of operations called "data pipelines" to start from nothing and assemble and clean data to arrive at a dataset you can actually use.

This tutorial will begin with a brief guide to loading datasets into Python, followed by an introduction to exploratory data analysis, and conclude with steps to clean and modify data. After this tutorial, you should be prepared to clean and manipulate raw datasets.

Prerequisites

This tutorial assumes that you have knowledge of package managers in python like conda and pip and a basic understanding of working in Python (how to write if/else/while statements, assign variables, etc) and at least a passing familiarity with the most commonly used tools in its associated standard library.

Jupyter Notebooks (optional integrated development environment (IDE))

Jupyter Notebooks are a useful interface for doing data analysis. Most practitioners of Pandas use Jupyter at least a little bit since the two tools are very well integrated and it makes things look pretty nice and makes live coding a much cleaner exercise.

1: Loading Data into Python

Resources

Pandas

Pandas is the workhorse of Python data analysis. Its dataframe data structure makes available a huge variety of tools. In addition Pandas is supported by a great variety of packages in Python for specialized data analysis and machine learning, which makes it a valuable core competency.

  • Official Pandas Tutorial Up to date and well maintained tutorial focused on getting you up to speed and running quickly
  • Daniel Chen Pandas Tutorial Good in-depth video walkthrough showing a full data analysis with explanations
  • Brandon Rhodes Pandas Tutorial Considered by many people the definitive intro to pandas. Be aware that some small changes have happened to the way pandas works since this was filmed, so you may need to google if the code examples don't work exactly as shown.

2: Exploratory Data Analysis

Exploratory Data Analysis (EDA)

EDA is the process of investigating a new dataset and cataloguing its features. Broadly it's the process of getting to know your data, getting it in the right format, and identifying any inconsistencies it might have. EDA should always be your first step when you get a new dataset, even if it's brief. Otherwise your conclusions may not mean what you think they do.

EDA is very personalized and is really all about learning to think deeply about a new dataset and cover your bases in a methodical way while keeping an eye out for any interesting trends. The below are provided as examples, but none are an authoritative workflow.

Loading Data to Pandas DataFrame:

import pandas as pd

pd.read_csv('#name of csv filepath')

Useful EDA Methods and Attributes of Pandas DataFrame (df) Type:

*Attribute: A value associated with an object or class which is referenced by name using dot notation.

*Method: A function that belongs to a class and typically performs an action or operation. 

df.head() returns first rows of dataframe.

df.info() summarizes dataframe.

df.describe() returns descriptive statistics of dataframe (mean, median, Q1, Q3).

df.shape returns tuple with shape of dataframe (ex: (2,3) for a dataframe with 2 rows and 3 columns).

df.size returns number of cells in dataframe.

3: Cleaning/Modifying Data

Numpy

NumPy is the library that underlies most Python data tools. It is more granular and allows many optimized mathematical operations for working with large arrays. It is especially useful for performing linear algebra operations like matrix multiplies, which are ubiquitous in machine learning and deep learning. Pandas is based on NumPy, and many of its data structures and operations act the way they do because they are built on top of NumPy's code and philosophy. For a deeper understanding of how to manipulate data, a working knowledge of NumPy can be very powerful.

Specialized Statistics Libraries

Once you have your data organized there are a number of options for doing data processing, drawing statistical conclusions, or building machine learning models. Explaining the inner workings and theory of these packages is beyond the scope of this tutorial, but if you want to investigate they are very powerful and useful tools. In some cases they can be useful for basic tasks like finding outliers or performing similar tasks using statistics-guided approaches.

  • scikit-learn The standard for performing general machine learning and testing tasks in Python.
  • statsmodels statsmodels includes various specialized statistical techniques and basic techniques with more comprehensive human readable output than scikit-learn. Useful for frequentist statistics tasks.
  • sciPy is useful for performing optimized numeric operations

Issues used in the creation of this page

#143

Contributors

Ryan Swan

Clone this wiki locally