This is a exploratory learning project that focuses on data analysis, visualization and presentation.
It uses the data from King County Housing which contains information about home sales in King County (USA). This is a popular, public dataset that you can find more information about here: https://www.kaggle.com/datasets/harlfoxem/housesalesprediction/code. You can find descriptions of the columns names here [link to column_names.md].
You are a real estate agent in King County and your client is looking for a property to buy and they have specific needs. They are looking to you to provide them with insights and recommendations that will help them decide on a property to purchase. These should take into account location, timing, pricing, etc. Presentation to client can be found here Detailed notebook on how each of the calculations were formed can be found here.
Larry Sanders - 45yo. Married 3 children Occupation - Property requirements - waterfront with a view, isolated and wooded with minimal neighbors (or older neighbors) Neighborhood - nice, central Schools - not a requirement Budget - limited (need range) Additional details - kids are homeschooled or attend virtually to avoid germs. The family is close knit and spend time together so land plot size is important. Family is good with a home that may need to be renovated.
The following sections will take you through the requirements needed to run the project and step-by-step getting you set up.
The following packages are required for this project. Included are descriptions of each one to better understand what they do, how you'll use them and why they are helpful.
These packages will be automatically installed when you run pip install -r requirements.txt as described in the Setup section below.
| Package | Description |
|---|---|
| Altair (5.3.0) | A declarative, beginner-friendly library for creating clean and interactive charts. Great for quick visualizations directly from pandas DataFrames. |
| Pandas (2.2.2) | The essential library for working with structured data in Python. Makes it easy to clean, filter, and analyze data stored in tables or CSV files. |
| NumPy (1.26.4) | Adds fast, efficient mathematical tools for handling large numerical arrays. It’s the backbone for most data and machine-learning libraries. |
| Matplotlib (3.9.1) | The classic Python plotting library for creating static charts such as line, bar, or scatter plots. Highly customizable for data presentation. |
| Seaborn (0.13.2) | Builds on Matplotlib to make beautiful, easy-to-read statistical graphics (like heatmaps, violin plots, and distributions) with minimal code. |
| Plotly (5.24.1) | Used for creating dynamic, interactive, and zoomable visualizations that work in notebooks or dashboards. Great for exploratory data analysis. |
| Scikit-Learn (1.5.1) | A robust library for machine learning. Includes ready-made algorithms for prediction, classification, and clustering, plus tools for data preparation. |
| GeoPandas (1.0.1) | Extends pandas to handle geographic data — like coordinates, shapes, and maps — making spatial analysis simple and visual. |
| SQLAlchemy (2.0.15) | A Python toolkit that simplifies connecting to and querying SQL databases, allowing you to use Pythonic commands instead of raw SQL. |
| psycopg2-binary (2.9.7) | A PostgreSQL database adapter that lets Python applications (like SQLAlchemy or pandas) talk directly to a PostgreSQL database. |
| python-dotenv (1.0.0) | Loads environment variables (like passwords or API keys) from a .env file into your project safely, so sensitive info isn’t hard-coded. |
| pytest (8.3.3) | A simple but powerful testing framework for writing and running unit tests. Helps ensure your code works as expected and stays reliable over time. |
First step is to clone this repository. This can be done from the green code button above. For more information on git check out some of the step-bs-step cheat-sheets here[shiny-octo] One of the first steps when starting any data science project is to create a virtual environment. For this project you have to create this environment from scratch yourself. However, you should be already familiar with the commands you will need to do so. The general workflow consists of...
- setting the python version locally to 3.11.3
- creating a virtual environment using the
venvmodule - activating your newly created environment
- upgrading
pip(This step is not absolutely necessary, but will save you trouble when installing some packages.) - installing the required packages via
pip
This repo contains a requirements.txt file with a list of all the packages and dependencies you will need.
Before you can start with plotly in Jupyter Lab you have to install node.js (if you haven't done it before).
- Check Node version by run the following commands:
If you haven't installed it yet, begin at
node -v
step_1. Otherwise, proceed tostep_2.
-
Step_1:Update Homebrew and install Node by following commands:brew update brew install node
-
Step_2:Install the virtual environment and the required packages by following commands:pyenv local 3.11.3 python -m venv .venv source .venv/bin/activate pip install --upgrade pip pip install -r requirements.txt
-
Step_1:Update Chocolatey and install Node by following commands:choco upgrade chocolatey choco install nodejs
-
Step_2:Install the virtual environment and the required packages by following commands.For
PowerShellCLI :pyenv local 3.11.3 python -m venv .venv .venv\Scripts\Activate.ps1 python -m pip install --upgrade pip pip install -r requirements.txt
For
Git-BashCLI :pyenv local 3.11.3 python -m venv .venv source .venv/Scripts/activate python -m pip install --upgrade pip pip install -r requirements.txt