Datahive is an ingenious, configuration-driven end-to-end data pipeline solution that simplifies the complexities of managing data workflows. Harnessing the power of Kafka, Hadoop, Apache Spark, Elasticsearch, Kibana, and an intuitive UI, Datahive empowers users to effortlessly manage and monitor their data stacks.
- 🚀 Streamlined data pipeline setup
- ☕ Automated data processing while you enjoy your coffee
- 📊 Utilizes Kafka, Hadoop, Apache Spark, Elasticsearch, and Kibana
- 🛠️ Easy configuration through YAML files
- 🔄 Supports both stream and batch processing
Define your data pipeline effortlessly using a simple YAML configuration file. Specify input and output schemas for each service, and let Datahive handle the rest. Below is a sample configuration for stream processing:
type: stream
kafka:
- inTopic: <your-topic-name>
outTopic: <your-topic-name>
hdfs: false
transform: |
def transform(record) {
def jsonObject = record
// do your transformation logic in a groovy script
return jsonObject
}
- inTopic: <your-topic-name>
hdfsFileName: <your-hdfs-filename>
hdfs: true
spark:
- app-resource: <path-for-your-spark-build-file>
driver.memory: 1g
executor.memory: 2g
- app-resource: <path-for-your-second-spark-build-file>
driver-memory: 1g
executor-memory: 2g
res-location: <path-for-the-spark-job-code>
main-class: <main-class-of-your-spark-job>
job-name: <name-of-your-job>
elasticsearch:
-
kibana:
dashboard-config:
- Clone the repository.
- Install the required dependencies.
- Configure Datahive using the provided YAML files.
- Run the application.
Home Page | Features |
---|---|
Highlights | Login Page |
---|---|
Dashboard | WorkerStats |
---|---|
Datahive Stack Stats | Alerts |
---|---|