PyroCluster is a powerful, scalable, and flexible real-time analytics platform built entirely on a modern open-source data stack. It's designed to ingest, process, and visualize massive streams of data with very low latency, providing immediate insights on a live dashboard.
This project serves as a reference architecture for integrating best-in-class technologies to solve complex big data challenges.
The cluster is built on a "best-of-breed" philosophy, where each component is chosen for its specific strengths in the data lifecycle.
- Foundation:
- Apache Hadoop (HDFS): Provides a robust, distributed data lake for long-term storage and batch processing.
- Apache YARN: Acts as the cluster's brain, efficiently managing resources for all processing jobs.
- Central Message Bus:
- Apache Kafka: Serves as the high-throughput, fault-tolerant central nervous system, decoupling data ingestion from data processing.
- Data Ingestion: (Choose the right tool for the source)
- Apache NiFi: For visual, flow-based programming to manage complex data routing and transformation from diverse sources.
- Apache Flume: For high-volume, reliable log data collection and aggregation into HDFS or Kafka.
- Logstash: For advanced parsing and enrichment of logs and machine-generated data.
- Real-Time Processing:
- Apache Spark: The core processing engine. Using Spark Streaming, it performs real-time analytics, aggregations, and transformations on data from Kafka.
- Visualization & Dashboarding:
- Apache Superset: A modern BI tool used to create interactive, real-time dashboards by directly querying the processed data from Spark.
The architecture is designed around a central Kafka bus, which enables a clean separation of concerns and massive scalability.
Data Flow:
Data Sources β‘οΈ Ingestion (NiFi / Flume / Logstash) β‘οΈ Apache Kafka (Message Bus) β‘οΈ Apache Spark (Real-Time Processing on YARN) β‘οΈ Apache Superset (Live Dashboard)
ββββββββββββββββββ
β Apache Superset β
β (Dashboard) β
βββββββββββ¬βββββββ
β (JDBC/ODBC)
βββββββββββ βββββββββββββββββββββββ βββββββββββββ¬ββββββββ
β β β Ingestion Layer β β Processing Layer β
β Data ββ‘οΈ β (NiFi / Flume / Logstash) ββ‘οΈ β (on YARN Cluster) β
β Sources β β β β β
βββββββββββ βββββββββββββ¬ββββββββ βββββββββββ¬ββββββββ
β β
βΌ βΌ
ββββββββββββββββ ββββββββββββββββ
β Apache Kafka β β Apache Spark β
β (Data Bus) ββββββββββββββ‘οΈβ (Streaming) β
ββββββββββββββ ββββββββββββββ
β
β (Cold Storage for Batch Analytics)
βΌ
ββββββββββββββββ
β Hadoop HDFS β
ββββββββββββββ
- Real-Time Analytics: Process and analyze data streams with latency in seconds.
- Scalable: Built on technologies designed for horizontal scaling, capable of handling petabytes of data.
- Decoupled: The Kafka-centric design allows components to be independently managed, updated, and scaled.
- Flexible Ingestion: Use a variety of tools to collect data from any source, including logs, APIs, IoT devices, and databases.
- Unified Batch & Real-Time: The architecture supports both real-time streaming (hot path) and deep batch analytics (cold path) on HDFS.
- Interactive Visualization: Go beyond simple reports with drill-down, interactive dashboards in Apache Superset.
This repository contains the configuration and reference architecture for building the PyroCluster.
(This section would be populated with specific setup instructions.)
- A multi-node server environment (Cloud or On-Premise)
- Docker and Docker Compose (for containerized deployment)
- Ansible or similar configuration management tool (recommended for production)
- Clone the repository:
git clone https://github.com/chmj/pyrocluster.git cd pyrocluster - Configure Environment Variables:
- Set the necessary hostnames, IP addresses, and service configurations in the provided
.envfile.
- Set the necessary hostnames, IP addresses, and service configurations in the provided
- Deploy the Cluster:
- Follow the detailed guides in the
/docsdirectory to deploy Hadoop, Kafka, Spark, and other services.
- Follow the detailed guides in the
Contributions are welcome! Please feel free to submit a pull request or open an issue for bugs, feature requests, or improvements to the architecture.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Distributed under the MIT License. See LICENSE for more information.
- Charles Majola
- GitHub: @chmj