Skip to content

chmj/PyroCluster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 

Repository files navigation

PyroCluster: Real-Time Big Data Analytics Platform

PyroCluster is a powerful, scalable, and flexible real-time analytics platform built entirely on a modern open-source data stack. It's designed to ingest, process, and visualize massive streams of data with very low latency, providing immediate insights on a live dashboard.

This project serves as a reference architecture for integrating best-in-class technologies to solve complex big data challenges.


πŸš€ Core Components

The cluster is built on a "best-of-breed" philosophy, where each component is chosen for its specific strengths in the data lifecycle.

  • Foundation:
    • Apache Hadoop (HDFS): Provides a robust, distributed data lake for long-term storage and batch processing.
    • Apache YARN: Acts as the cluster's brain, efficiently managing resources for all processing jobs.
  • Central Message Bus:
    • Apache Kafka: Serves as the high-throughput, fault-tolerant central nervous system, decoupling data ingestion from data processing.
  • Data Ingestion: (Choose the right tool for the source)
    • Apache NiFi: For visual, flow-based programming to manage complex data routing and transformation from diverse sources.
    • Apache Flume: For high-volume, reliable log data collection and aggregation into HDFS or Kafka.
    • Logstash: For advanced parsing and enrichment of logs and machine-generated data.
  • Real-Time Processing:
    • Apache Spark: The core processing engine. Using Spark Streaming, it performs real-time analytics, aggregations, and transformations on data from Kafka.
  • Visualization & Dashboarding:
    • Apache Superset: A modern BI tool used to create interactive, real-time dashboards by directly querying the processed data from Spark.

🏝️ System Architecture

The architecture is designed around a central Kafka bus, which enables a clean separation of concerns and massive scalability.

Data Flow: Data Sources ➑️ Ingestion (NiFi / Flume / Logstash) ➑️ Apache Kafka (Message Bus) ➑️ Apache Spark (Real-Time Processing on YARN) ➑️ Apache Superset (Live Dashboard)

                                                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                                   β”‚ Apache Superset  β”‚
                                                   β”‚  (Dashboard)     β”‚
                                                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                                                             β”‚ (JDBC/ODBC)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
β”‚           β”‚   β”‚  Ingestion Layer           β”‚   β”‚      Processing Layer      β”‚
β”‚ Data        β”œβž‘οΈ β”‚  (NiFi / Flume / Logstash) β”œβž‘οΈ β”‚      (on YARN Cluster)   β”‚
β”‚ Sources    β”‚   β”‚                         β”‚   β”‚                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                                    β”‚                                    β”‚
                                    β”Ό                                    β”Ό
                             β”Œβ”€β”€β”€β”€β”€β”€β”€β””β”€β”€β”€β”€β”€β”€β”˜               β”Œβ”€β”€β”€β”€β”€β”€β”€β””β”€β”€β”€β”€β”€β”€β”˜
                             β”‚ Apache Kafka  β”‚               β”‚  Apache Spark β”‚
                             β”‚ (Data Bus)    β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€βž‘οΈβ”‚  (Streaming)  β”‚
                             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                     β”‚
                                     β”‚ (Cold Storage for Batch Analytics)
                                     β”Ό
                             β”Œβ”€β”€β”€β”€β”€β”€β”€β””β”€β”€β”€β”€β”€β”€β”˜
                             β”‚  Hadoop HDFS  β”‚
                             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

✨ Features

  • Real-Time Analytics: Process and analyze data streams with latency in seconds.
  • Scalable: Built on technologies designed for horizontal scaling, capable of handling petabytes of data.
  • Decoupled: The Kafka-centric design allows components to be independently managed, updated, and scaled.
  • Flexible Ingestion: Use a variety of tools to collect data from any source, including logs, APIs, IoT devices, and databases.
  • Unified Batch & Real-Time: The architecture supports both real-time streaming (hot path) and deep batch analytics (cold path) on HDFS.
  • Interactive Visualization: Go beyond simple reports with drill-down, interactive dashboards in Apache Superset.

πŸš€ Getting Started

This repository contains the configuration and reference architecture for building the PyroCluster.

(This section would be populated with specific setup instructions.)

Prerequisites

  • A multi-node server environment (Cloud or On-Premise)
  • Docker and Docker Compose (for containerized deployment)
  • Ansible or similar configuration management tool (recommended for production)

Installation

  1. Clone the repository:
    git clone https://github.com/chmj/pyrocluster.git
    cd pyrocluster
  2. Configure Environment Variables:
    • Set the necessary hostnames, IP addresses, and service configurations in the provided .env file.
  3. Deploy the Cluster:
    • Follow the detailed guides in the /docs directory to deploy Hadoop, Kafka, Spark, and other services.

Contributing

Contributions are welcome! Please feel free to submit a pull request or open an issue for bugs, feature requests, or improvements to the architecture.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

πŸ“œ License

Distributed under the MIT License. See LICENSE for more information.


Author

  • Charles Majola
  • GitHub: @chmj

About

PyroCluster: Real-Time Big Data Analytics Platform

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published