The goal of this project is to build a mini data pipeline for real estate price prediction in Oman. It starts by scraping property listings from two local websites, cleaning and integrating the data, engineering useful features, and framing a predictive modeling problem based on property prices.
Two real estate platforms were used:
- Dubizzle Oman β scraped using
BeautifulSoup - Tibiaan β scraped using
Selenium
This project was a hands-on learning experience in web scraping, using two different techniques to deal with static and dynamic content.
- Explore: Analyzed both websites to understand their structure and potential fields to extract.
- Plan: Created an Excel sheet listing possible data fields and identified the overlap between both platforms.
- Scrape: Fetched data using Python scripts, then cleaned:
- Removed duplicates
- Filled missing values using
mean,median, ormodedepending on context - Trimmed extra spaces to improve matching and consistency
- Merge: Cleaned the datasets individually, then integrated them for modeling.
- Understanding the data: Identified important columns and assessed their value.
- New features: Created new columns and converted types where needed.
- Scaling: Used Box-Cox transformation on numerical features to normalize data.
- Encoding: Applied OneHotEncoder to handle categorical features for modeling.
- Python
- Pandas & NumPy
- BeautifulSoup
- Selenium
- Scikit-learn
This project highlights my ability to go from raw web data to a clean, structured dataset ready for modeling β combining web scraping, data preprocessing, and feature engineering in one pipeline.