This project implements a real-time distributed disaster prediction pipeline designed to mimic production-grade data engineering architectures used in top-tier technology companies.
Leveraging Apache Kafka, Spark Structured Streaming, and Deep Learning (UNet & BERT), the system processes multi-modal data streamsβsatellite imagery, live weather data, and social media feedsβto generate real-time risk assessments and flood segmentation masks. The results are visualized on a live, low-latency dashboard.
The system is designed with scalability, modularity, and fault tolerance in mind. It uses a decoupled microservices approach where analysis.py acts as the primary orchestrator.
- Ingestion:
- Satellite images are streamed via TCP.
- Twitter and Weather data are fetched via API simulation.
- Processing (Spark Structured Streaming):
- Vision: A UNet Deep Learning model performs semantic segmentation on flood images.
- NLP: A BERT-based model analyzes sentiment and urgency in social media text.
- Messaging Backbone:
- Results are serialized and published to Apache Kafka topics (
flood_topic,weather_topic,twitter_topic).
- Results are serialized and published to Apache Kafka topics (
- Consumption & Visualization:
- A Flask backend consumes Kafka messages.
- Data is pushed to the frontend via WebSockets (SocketIO) for real-time updates.
| Component | Technology | Description |
|---|---|---|
| Orchestration | Python | System bootstrapping and logic control. |
| Streaming | Apache Kafka | Distributed event streaming and message decoupling. |
| Processing | Spark Streaming | Real-time distributed data processing. |
| ML/AI | PyTorch | UNet for Image Segmentation; Transformers (BERT) for NLP. |
| Backend | Flask | REST API and WebSocket server. |
| Frontend | HTML/JS | Real-time dashboard using SocketIO. |
| Coordination | ZooKeeper | Kafka state management. |
Create a clean Conda environment to manage dependencies.
conda create -n disaster_detect python=3.10
conda activate disaster_detectInstall the required Python packages:
pip install pandas torch transformers opencv-python numpy pyspark flask flask-socketio kafka-python requestsNote: Large dataset files are excluded from the repository. You must create the directory structure manually.
Run the following command from the project root:
mkdir -p data/flood_dataset/images
mkdir -p data/flood_dataset/masks
mkdir -p data/simulated_stream
mkdir -p data/twitter_dataAction Required:
- Place your flood
.jpgimages indata/flood_dataset/images/. - Place your corresponding
.pngmasks indata/flood_dataset/masks/. - Ensure filenames match (e.g.,
1.jpgcorresponds to1.png).
Before running the pipeline, you must train the Deep Learning models.
This script trains the vision model for 15 epochs and saves the weights.
cd ml_models/flood_detection
python train.pyOutput: ml_models/flood_detection/flood_unet_cpu.pth
Fine-tune the NLP model using the provided Jupyter Notebook.
cd twitter
jupyter notebook flood_twitter_data_train.ipynbRun all cells to generate model.safetensors and config files in ml_models/twitter_model/.
The system requires 3 separate terminal windows running concurrently to simulate the distributed environment.
Starts the TCP server on port 5001. This listens for incoming satellite images.
cd satellite
python get_stream.pyStatus: You should see "Waiting for sender..."
This is the core script. It automatically bootstraps ZooKeeper, Kafka, Spark Streams, and the Flask Dashboard.
cd spark_jobs
python analysis.pyStatus: Kafka and Spark logs will appear. The dashboard will go live at http://localhost:5000.
Once the server (Terminal 1) and Pipeline (Terminal 2) are running, start streaming the data.
cd satellite
python send_stream.pyAction: This reads images from your dataset and streams them every 3 seconds to the processing engine
Open your web browser and navigate to:
The dashboard displays:
- Live Flood Segmentation: Real-time overlay of flood risk on satellite imagery.
- Disaster Analytics: Aggregated metrics from Twitter sentiment and weather alerts.
- System Health: Latency and processing status.
Disaster_Prediction/
βββ data/ # (Created manually)
β βββ flood_dataset/ # Training Data
β βββ simulated_stream/ # Live incoming buffer
βββ kafka/ # Kafka binaries & config
βββ ml_models/ # Model artifacts
β βββ flood_detection/ # UNet training scripts & weights
β βββ twitter_model/ # BERT config & weights
βββ satellite/ # TCP Stream simulation (Client/Server)
βββ spark_jobs/ # Core processing logic
β βββ analysis.py # SYSTEM ENTRY POINT
β βββ streaming.py # Spark Image Stream
β βββ twitter_stream.py # Spark NLP Stream
β βββ weather_stream.py # Spark Weather Stream
β βββ frontend/ # Dashboard UI (HTML/CSS/JS)
βββ twitter/ # NLP Training Notebooks
βββ README.mdThis project demonstrates capabilities in:
- Distributed Systems Design: Handling asynchronous data streams via Kafka.
- Real-Time Inference: Deploying PyTorch models within a Spark Streaming context.
- Fault Tolerance: Utilizing Kafka buffering and Spark checkpointing.
- Full-Stack Data Engineering: Managing the flow from raw binary TCP streams to web-socket based visualization.
