Skip to content

soupond/Data_Science_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data_Science_Project

A small end‑to‑end data science project demonstrating:

  1. Data Collection via web scraping (BeautifulSoup, Scrapy, Selenium).
  2. Data Storage of raw and processed files in data/.
  3. Exploratory Data Analysis and visualization in a Jupyter notebook.

Table of Contents


Project Structure

Data_Science_Project/
├── data/                   # Raw and processed datasets (CSV, JSON, etc.)
├── scraping/               # Standalone Python scripts (BeautifulSoup, Selenium)
├── spider/                 # Scrapy project and spider definitions
├── data_analysis.ipynb     # Jupyter notebook for EDA & visualization
└── README.md               # Project overview and instructions

Prerequisites

  • Python 3.8 or higher
  • Git (to clone this repository)

Installation

  1. Clone the repository

    git clone https://github.com/soupond/Data_Science_Project.git
    cd Data_Science_Project
    1. Create and activate a virtual environment (recommended) bashbash git clone https://github.com/soupond/Data_Science_Project.git cd Data_Science_Project
  2. Create and activate a virtual environment (recommended)

    python3 -m venv venv
    source venv/bin/activate    # Windows: venv\Scripts\activate
  3. Install dependencies

    pip install -r requirements.txt

    If a requirements.txt is not present, install manually:

    pip install pandas numpy matplotlib jupyter scrapy beautifulsoup4 requests selenium

Usage

Ad‑hoc Python Scrapers

Standalone scripts using BeautifulSoup or Selenium are located in scraping/. To run one:

python scraping/bs4_scraper.py

The script will save output files under data/ (e.g., data/raw_listings.csv).

Scrapy Spiders

The Scrapy project lives in the spider/ directory. To crawl and export data:

Exploratory Data Analysis

Launch Jupyter Notebook and open the analysis notebook:

jupyter notebook data_analysis.ipynb

Inside, you’ll find:

  • Data loading and cleaning steps
  • Descriptive statistics and data summaries
  • Visualizations (histograms, scatter plots, heatmaps)
  • Key insights and recommendations

Requirements

A requirements.txt file lists all Python package dependencies for this project. Install them with:

pip install -r requirements.txt

Contributing

Contributions are welcome! To contribute:

  1. Fork the repository.
  2. Create a new branch (git checkout -b feature/YourFeature).
  3. Make your changes and commit with a clear message.
  4. Push to your fork and open a Pull Request.

Please ensure:

  • Code follows PEP8 style guidelines.
  • Dependencies are updated in requirements.txt.
  • README is kept up to date with any new scripts or features.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published