In this section, you've learned a lot about importing, cleaning up, analyzing (using descriptive statistics) and visualizing data. In this cumulative lab, you'll get a chance to practice all of these skills with the Ames Housing dataset, which contains information about home sales in Ames, Iowa between 2006 and 2010.
You will be able to:
- Practice loading data with pandas
- Practice calculating measures of centrality and dispersion with pandas
- Practice creating subsets of data with pandas
- Practice using data visualizations to explore data, and interpreting those visualizations
- Perform a full exploratory data analysis process to gain insight about a dataset
Photo by Matt Donders on Unsplash
Each record (row) in this dataset represents a home that was sold in Ames, IA.
Each feature (column) in this dataset is some attribute of that home sale. You can view the file data/data_description.txt
in this repository for a full explanation of all variables in this dataset — 80 columns in total.
We are going to focus on the following features:
SalePrice: Sale price of the house in dollars
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
OverallCond: Rates the overall condition of the house
10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very Poor
YrSold: Year Sold (YYYY)
YearBuilt: Original construction date
LandSlope: Slope of property
Gtl Gentle slope
Mod Moderate Slope
Sev Severe Slope
In this lab you will use your data munging and visualization skills to conduct an exploratory analysis of the dataset.
Import pandas with the standard alias pd
and load the data into a dataframe with the standard name df
.
Produce summary statistics, visualizations, and interpretive text describing the distributions of SalePrice
, TotRmsAbvGrd
, and OverallCond
.
Separate the data into subsets based on OverallCond
, then demonstrate how this split impacts the distribution of SalePrice
.
Find the features that have the strongest positive and negative correlations with SalePrice
, and produce plots representing these relationships.
Create a new feature Age
, which represents the difference between the year sold and the year built, and plot the relationship between the age and sale price.
In the cell below, import:
pandas
with the standard aliaspd
matplotlib.pyplot
with the standard aliasplt
And set %matplotlib inline
so the graphs will display immediately below the cell that creates them.
# Your code here
Now, use pandas to open the file located at data/ames.csv
(documentation here). Specify the argument index_col=0
in order to avoid creating an extra Id
column. Name the resulting dataframe df
.
# Your code here
The following code checks that you loaded the data correctly:
# Run this cell without changes
# Check that df is a dataframe
assert type(df) == pd.DataFrame
# Check that there are the correct number of rows
assert df.shape[0] == 1460
# Check that there are the correct number of columns
# (if this crashes, make sure you specified `index_col=0`)
assert df.shape[1] == 80
Inspect the contents of the dataframe:
# Run this cell without changes
df
# Run this cell without changes
df.info()
Write code to produce histograms showing the distributions of SalePrice
, TotRmsAbvGrd
, and OverallCond
.
Each histogram should have appropriate title and axes labels, as well as a black vertical line indicating the mean of the dataset. See the documentation for plotting histograms, customizing axes, and plotting vertical lines as needed.
In the cell below, produce a histogram for SalePrice
.
# Your code here
Now, print out the mean, median, and standard deviation:
# Your code here
In the cell below, interpret the above information.
# Replace None with appropriate text
"""
None
"""
In the cell below, produce a histogram for TotRmsAbvGrd
.
# Your code here
Now, print out the mean, median, and standard deviation:
# Your code here
In the cell below, interpret the above information.
# Replace None with appropriate text
"""
None
"""
In the cell below, produce a histogram for OverallCond
.
# Your code here
Now, print out the mean, median, and standard deviation:
# Your code here
In the cell below, interpret the above information.
# Replace None with appropriate text
"""
None
"""
As you might have noted in the previous step, the overall condition of the house seems like we should treat it as more of a categorical variable, rather than a numeric variable.
One useful way to explore a categorical variable is to create subsets of the full dataset based on that categorical variable, then plot their distributions based on some other variable. Since this dataset is traditionally used for predicting the sale price of a house, let's use SalePrice
as that other variable.
In the cell below, create three variables, each of which represents a record-wise subset of df
(meaning, it has the same columns as df
, but only some of the rows).
below_average_condition
: home sales where the overall condition was less than 5average_condition
: home sales where the overall condition was exactly 5above_average_condition
: home sales where the overall condition was greater than 5
# Replace None with appropriate code
below_average_condition = None
average_condition = None
above_average_condition = None
The following code checks that you created the subsets correctly:
# Run this cell without changes
# Check that all of them still have 80 columns
assert below_average_condition.shape[1] == 80
assert average_condition.shape[1] == 80
assert above_average_condition.shape[1] == 80
# Check the numbers of rows of each subset
assert below_average_condition.shape[0] == 88
assert average_condition.shape[0] == 821
assert above_average_condition.shape[0] == 551
The following code will produce a plot of the distributions of sale price for each of these subsets:
# Run this cell without changes
# Set up plot
fig, ax = plt.subplots(figsize=(15,5))
# Create custom bins so all are on the same scale
bins = range(df["SalePrice"].min(), df["SalePrice"].max(), int(df["SalePrice"].median()) // 20)
# Plot three histograms, with reduced opacity (alpha) so we
# can see them overlapping
ax.hist(
x=above_average_condition["SalePrice"],
label="above average condition",
bins=bins,
color="cyan",
alpha=0.5
)
ax.hist(
x=average_condition["SalePrice"],
label="average condition",
bins=bins,
color="gray",
alpha=0.3
)
ax.hist(
x=below_average_condition["SalePrice"],
label="below average condition",
bins=bins,
color="yellow",
alpha=0.5
)
# Customize labels
ax.set_title("Distributions of Sale Price Grouped by Condition")
ax.set_xlabel("Sale Price")
ax.set_ylabel("Number of Houses")
ax.legend();
Interpret the plot above. What does it tell us about these overall condition categories, and the relationship between overall condition and sale price? Is there anything surprising?
# Replace None with appropriate text
"""
None
"""
To understand more about what features of these homes lead to higher sale prices, let's look at some correlations. We'll return to using the full df
, rather than the subsets.
In the cell below, print out both the name of the column and the Pearson correlation for the column that is most positively correlated with SalePrice
(other than SalePrice
, which is perfectly correlated with itself).
We'll only check the correlations with some kind of numeric data type.
You can import additional libraries, although it is possible to do this just using pandas.
# Your code here
Now, find the most negatively correlated column:
# Your code here
Once you have your answer, edit the code below so that it produces a box plot of the relevant columns.
# Replace None with appropriate code
import seaborn as sns
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15,5))
# Plot distribution of column with highest correlation
sns.boxplot(
x=None,
y=df["SalePrice"],
ax=ax1
)
# Plot distribution of column with most negative correlation
sns.boxplot(
x=None,
y=df["SalePrice"],
ax=ax2
)
# Customize labels
ax1.set_title(None)
ax1.set_xlabel(None)
ax1.set_ylabel("Sale Price")
ax2.set_title(None)
ax2.set_xlabel(None)
ax2.set_ylabel("Sale Price");
Interpret the results below. Consult data/data_description.txt
as needed.
# Replace None with appropriate text
"""
None
"""
Here the code is written for you, all you need to do is interpret it.
We note that the data spans across several years of sales:
# Run this cell without changes
df["YrSold"].value_counts().sort_index()
Maybe we can learn something interesting from the age of the home when it was sold. This uses information from the YrBuilt
and YrSold
columns, but represents a truly distinct feature.
# Run this cell without changes
# Make a new column, Age
df["Age"] = df["YrSold"] - df["YearBuilt"]
# Set up plot
fig, ax = plt.subplots(figsize=(15,5))
# Plot Age vs. SalePrice
ax.scatter(df["Age"], df["SalePrice"], alpha=0.3, color="green")
ax.set_title("Home Age vs. Sale Price")
ax.set_xlabel("Age of Home at Time of Sale")
ax.set_ylabel("Sale Price");
Interpret this plot below:
# Replace None with appropriate text
"""
None
"""
Congratulations, you've completed an exploratory data analysis of a popular dataset. You saw how to inspect the distributions of individual columns, subsets of columns, correlations, and new engineered features.