Airbnb Price Prediction Project (Chicago)

Foreword

This project is a part of James Rocco Research Scholarship provided by Lake Forest College and was carried out under the supervision of Prof. Arthur Bousquet. The main idea is based on an article by Graciela Carrillo posted on Towards Data Science.

Installation

Using Anaconda create a new environment from environment.yml.

conda env create --file environment.yml

File Descriptions (Follow in order)

To conveniently read all the notebooks follow this link.

Notebooks/:

EDA.ipynb (Exploratory Data Analysis) - A brief overview and analysis of raw data
kepler_map.ipynb - Visualization of the whole dataset using Kepler.gl
data_preprocessing.ipynb - Preprocessing the data for future uses (outlier detection, feature selection, handling missing data, etc.)
regressions.ipynb - Development of initial price prediction models
cta_mapping.ipynb - Visualization of geo_loc.py using Folium maps (Map of routes to CTAs in the radius and shortest path detection)
model.ipynb - Final model for price prediciton that compares the results of datasets with and without newly produced variables

Scripts:

geo_loc.py - A python script for geospatial analysis: creates 5 new variables using such libraries as OSMnx and OpenRouteSerivce:
- Restaurants - Number of restaurants in a 1000 meters radius
- Cafes - Number of cafes in the radius
- Bars - Number of bard in the radius
- CTA - Number of CTA (Chicago Subway) stations in the radius
- time_to_cta_minutes - Time in minutes to the nearest CTA station (can be out of the radius)

Project Description and Results

The main goal of this project is to build a model that predicts the price of a listing given its dependent variables. The data for both dependent and independent variables is available through Insideairbnb.com. To get a general understanding of the data used for this project, take a look at the map below where the data is projected on the map of Chicago. Listings (i.e. rows in the dataset) are grouped within hexagons whose height represents the listings count and the color represents the price range.

The accuracy, i.e. how well the model performs, is measured by R^2 a metric commonly used for regression models that represents the proportion of the variance for a dependent variable that's explained by independent variables. To further improve the accuracy and add originality to the project, 5 new variables are created by analyzing surrounding areas and fetching distances to chosen types of locations as well as calculating the walking time to the nearest subway station.

Since it is possible to visualize locations and routes, below you can see a map with routes to all subway stations within the range of 1000m (the circle) and with the shortest route colored in green. Red dots represent subway station that lie outside the wanted radius.

Here is the list of variables used to predict the price of a listings:

Numerical variables:

Accommodates - Number of people a listing can accommodate
Bathrooms - Number of bathrooms
Minimum_nights - Minimum amount of nights a listing should be booked for
Maxium_nights - Maximum amount of nights a listing can be booked for
Availability_30 - Number of days a listing is available in the next 30 days
Number_of_reviews - Total number of reviews
Number_of_reviews_ltm - Number of reviews within last month
Restaurants, Bars, Cafes, Universities - Number of places of specified type within 1000 meters from the listing (4 different variables)
Time_to_cta_minutes - Time it takes to walk to the nearest subway (in Chicago CTA) station (Distance does not matter)

Categorical variables:

Neighbourhood_cleansed - name of the neighborhood a listing is located in
Property_type - type of property a listing is located in (e.g. Apartment, Condomonium, House, etc.)
Bed_type - type of bed present in a listing
Cancellation_policy - type of cancellation policy chosen by the host

As it can be apparent from file descriptions, a step-by-step approach was taken to build the model. To understand the model and the thought process you can read through the notebooks.

To achieve the best possible result I tried various models and these are the results:

_	Linear Regression	Lasso Regression	Ridge Regression	Lasso Regression with Polynomial Features	Ridge Regression with Polynomial Features	XGBoost
No new variables: Train R²	0.4298	0.4294	0.4296	0.4918	0.5143	0.6428
Test R²	0.4607	0.462	0.4615	0.4867	0.4925	0.5391
With new variables: Train R²	0.4387	0.4375	0.4384	0.5036	0.5224	0.6742
Test R²	0.411	0.4163	0.4129	0.4503	0.453	0.5445

A good way to see the difference in modeling between the data without and with the new variables is to look at feature importances computed by XGBoost. Categorical variables are not included not to overcrowd the plot.

As we can see on the bottom plot the new features possess high importance (higher than some of the initial features).

Resources

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
data		data
kepler.gl		kepler.gl
notebooks		notebooks
reports		reports
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
geo_loc.py		geo_loc.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Airbnb Price Prediction Project (Chicago)

Table of Contents

Foreword

Installation

File Descriptions (Follow in order)

Project Description and Results

Resources

Package Docs

DataCamp

Youtube Videos

Medium Articles

About

Releases

Packages

Languages

amac-lfc/airbnb

Folders and files

Latest commit

History

Repository files navigation

Airbnb Price Prediction Project (Chicago)

Table of Contents

Foreword

Installation

File Descriptions (Follow in order)

Project Description and Results

Resources

Package Docs

DataCamp

Youtube Videos

Medium Articles

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages