Skip to content

Adityavasudev2006/DisPred

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DisPred : Real-Time Disaster Prediction System

Python Apache Spark Apache Kafka PyTorch Flask

πŸ“– Executive Summary

This project implements a real-time distributed disaster prediction pipeline designed to mimic production-grade data engineering architectures used in top-tier technology companies.

Leveraging Apache Kafka, Spark Structured Streaming, and Deep Learning (UNet & BERT), the system processes multi-modal data streamsβ€”satellite imagery, live weather data, and social media feedsβ€”to generate real-time risk assessments and flood segmentation masks. The results are visualized on a live, low-latency dashboard.

πŸ— System Architecture

The system is designed with scalability, modularity, and fault tolerance in mind. It uses a decoupled microservices approach where analysis.py acts as the primary orchestrator.

System Architecture

End-to-End Data Flow

  1. Ingestion:
    • Satellite images are streamed via TCP.
    • Twitter and Weather data are fetched via API simulation.
  2. Processing (Spark Structured Streaming):
    • Vision: A UNet Deep Learning model performs semantic segmentation on flood images.
    • NLP: A BERT-based model analyzes sentiment and urgency in social media text.
  3. Messaging Backbone:
    • Results are serialized and published to Apache Kafka topics (flood_topic, weather_topic, twitter_topic).
  4. Consumption & Visualization:
    • A Flask backend consumes Kafka messages.
    • Data is pushed to the frontend via WebSockets (SocketIO) for real-time updates.

πŸ›  Tech Stack

Component Technology Description
Orchestration Python System bootstrapping and logic control.
Streaming Apache Kafka Distributed event streaming and message decoupling.
Processing Spark Streaming Real-time distributed data processing.
ML/AI PyTorch UNet for Image Segmentation; Transformers (BERT) for NLP.
Backend Flask REST API and WebSocket server.
Frontend HTML/JS Real-time dashboard using SocketIO.
Coordination ZooKeeper Kafka state management.

πŸš€ Installation & Setup

1. Environment Setup

Create a clean Conda environment to manage dependencies.

conda create -n disaster_detect python=3.10
conda activate disaster_detect

Install the required Python packages:

pip install pandas torch transformers opencv-python numpy pyspark flask flask-socketio kafka-python requests

2. Directory Structure & Data

Note: Large dataset files are excluded from the repository. You must create the directory structure manually.

Run the following command from the project root:

mkdir -p data/flood_dataset/images
mkdir -p data/flood_dataset/masks
mkdir -p data/simulated_stream
mkdir -p data/twitter_data

Action Required:

  1. Place your flood .jpg images in data/flood_dataset/images/.
  2. Place your corresponding .png masks in data/flood_dataset/masks/.
  3. Ensure filenames match (e.g., 1.jpg corresponds to 1.png).

🧠 Model Training

Before running the pipeline, you must train the Deep Learning models.

A. Train Flood Detection (UNet)

This script trains the vision model for 15 epochs and saves the weights.

cd ml_models/flood_detection
python train.py

Output: ml_models/flood_detection/flood_unet_cpu.pth

B. Train Sentiment Analysis (BERT)

Fine-tune the NLP model using the provided Jupyter Notebook.

cd twitter
jupyter notebook flood_twitter_data_train.ipynb

Run all cells to generate model.safetensors and config files in ml_models/twitter_model/.

⚑ How to Run the Pipeline

The system requires 3 separate terminal windows running concurrently to simulate the distributed environment.

TERMINAL 1: Image Receiver (TCP Server)

Starts the TCP server on port 5001. This listens for incoming satellite images.

cd satellite
python get_stream.py

Status: You should see "Waiting for sender..."

TERMINAL 2: Main Orchestrator

This is the core script. It automatically bootstraps ZooKeeper, Kafka, Spark Streams, and the Flask Dashboard.

cd spark_jobs
python analysis.py

Status: Kafka and Spark logs will appear. The dashboard will go live at http://localhost:5000.

TERMINAL 3: Image Sender (TCP Client)

Once the server (Terminal 1) and Pipeline (Terminal 2) are running, start streaming the data.

cd satellite
python send_stream.py

Action: This reads images from your dataset and streams them every 3 seconds to the processing engine

πŸ“Š Viewing Results

Open your web browser and navigate to:

http://localhost:5000

The dashboard displays:

  • Live Flood Segmentation: Real-time overlay of flood risk on satellite imagery.
  • Disaster Analytics: Aggregated metrics from Twitter sentiment and weather alerts.
  • System Health: Latency and processing status.

πŸ“‚ Project Structure

Disaster_Prediction/
β”œβ”€β”€ data/                        # (Created manually)
β”‚   β”œβ”€β”€ flood_dataset/           # Training Data
β”‚   └── simulated_stream/        # Live incoming buffer
β”œβ”€β”€ kafka/                       # Kafka binaries & config
β”œβ”€β”€ ml_models/                   # Model artifacts
β”‚   β”œβ”€β”€ flood_detection/         # UNet training scripts & weights
β”‚   └── twitter_model/           # BERT config & weights
β”œβ”€β”€ satellite/                   # TCP Stream simulation (Client/Server)
β”œβ”€β”€ spark_jobs/                  # Core processing logic
β”‚   β”œβ”€β”€ analysis.py              # SYSTEM ENTRY POINT
β”‚   β”œβ”€β”€ streaming.py             # Spark Image Stream
β”‚   β”œβ”€β”€ twitter_stream.py        # Spark NLP Stream
β”‚   β”œβ”€β”€ weather_stream.py        # Spark Weather Stream
β”‚   └── frontend/                # Dashboard UI (HTML/CSS/JS)
β”œβ”€β”€ twitter/                     # NLP Training Notebooks
└── README.md

βš–οΈ Engineering Impact

This project demonstrates capabilities in:

  • Distributed Systems Design: Handling asynchronous data streams via Kafka.
  • Real-Time Inference: Deploying PyTorch models within a Spark Streaming context.
  • Fault Tolerance: Utilizing Kafka buffering and Spark checkpointing.
  • Full-Stack Data Engineering: Managing the flow from raw binary TCP streams to web-socket based visualization.

About

Real-time disaster prediction system using Kafka and Spark Structured Streaming with ML-based flood detection (UNet), weather risk analysis, and Twitter sentiment classification. Event-driven, scalable architecture with live dashboard.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors