Skip to content

Intro to Data Analysis With Python

zaklang123 edited this page Jul 11, 2023 · 45 revisions

Overview

In the wild data is dirty and data is often disorganized. In order to make a resource useful you may need to filter it, modify it, and combine it with other resources. You may need to do these operations thousands or millions of times in sequence to arrive at your final useful dataset. And you may need to retrace your steps to do it all again in order to teach somebody else to do it. Or if your hard drive crashes. (Back up your hard drive)

Data tools in Python can allow you to create organized sets of operations called "data pipelines" to start from nothing and assemble and clean data to arrive at a dataset you can actually use.

1: Loading Data into Python

2: Exploratory Data Analysis

3: Cleaning/Modifying Data

Prerequisites

This tutorial assumes that you have knowledge of package managers in python like conda and pip and a basic understanding of working in Python (how to write if/else/while statements, assign variables, etc) and at least a passing familiarity with the most commonly used tools in its associated standard library.

Jupyter Notebooks (optional integrated development environment (IDE))

Jupyter Notebooks are a useful interface for doing data analysis. Most practitioners of Pandas use Jupyter at least a little bit since the two tools are very well integrated and it makes things look pretty nice and makes live coding a much cleaner exercise.

Resources

Pandas

Pandas is the workhorse of Python data analysis. Its dataframe data structure makes available a huge variety of tools. In addition Pandas is supported by a great variety of packages in Python for specialized data analysis and machine learning, which makes it a valuable core competency.

  • Official Pandas Tutorial Up to date and well maintained tutorial focused on getting you up to speed and running quickly
  • Daniel Chen Pandas Tutorial Good in-depth video walkthrough showing a full data analysis with explanations
  • Brandon Rhodes Pandas Tutorial Considered by many people the definitive intro to pandas. Be aware that some small changes have happened to the way pandas works since this was filmed, so you may need to google if the code examples don't work exactly as shown.

Exploratory Data Analysis (EDA)

EDA is the process of investigating a new dataset and cataloguing its features. Broadly it's the process of getting to know your data, getting it in the right format, and identifying any inconsistencies it might have. EDA should always be your first step when you get a new dataset, even if it's brief. Otherwise your conclusions may not mean what you think they do.

EDA is very personalized and is really all about learning to think deeply about a new dataset and cover your bases in a methodical way while keeping an eye out for any interesting trends. The below are provided as examples, but none are an authoritative workflow.

Loading Data to Pandas DataFrame:

import pandas as pd

pd.read_csv('#name of csv filepath')

Useful EDA Methods and Attributes of Pandas DataFrame (df) Type:

*Attribute: A value associated with an object or class which is referenced by name using dot notation.

*Method: A function that belongs to a class and typically performs an action or operation. 

df.head() returns first rows of dataframe.

df.info() summarizes dataframe.

df.describe() returns descriptive statistics of dataframe (mean, median, Q1, Q3).

df.shape returns tuple with shape of dataframe (ex: (2,3) for a dataframe with 2 rows and 3 columns).

df.size returns number of cells in dataframe.

Numpy

NumPy is the library that underlies most Python data tools. It is more granular and allows many optimized mathematical operations for working with large arrays. It is especially useful for performing linear algebra operations like matrix multiplies, which are ubiquitous in machine learning and deep learning. Pandas is based on NumPy, and many of its data structures and operations act the way they do because they are built on top of NumPy's code and philosophy. For a deeper understanding of how to manipulate data, a working knowledge of NumPy can be very powerful.

Specialized Statistics Libraries

Once you have your data organized there are a number of options for doing data processing, drawing statistical conclusions, or building machine learning models. Explaining the inner workings and theory of these packages is beyond the scope of this tutorial, but if you want to investigate they are very powerful and useful tools. In some cases they can be useful for basic tasks like finding outliers or performing similar tasks using statistics-guided approaches.

  • scikit-learn The standard for performing general machine learning and testing tasks in Python.
  • statsmodels statsmodels includes various specialized statistical techniques and basic techniques with more comprehensive human readable output than scikit-learn. Useful for frequentist statistics tasks.
  • sciPy is useful for performing optimized numeric operations

Issues used in the creation of this page

#143

Contributors

Ryan Swan

Clone this wiki locally