This project implements a log analytics pipeline that utilizes a combination of Big Data technologies, including Kafka, Spark Streaming, Parquet files, and the HDFS file system. The pipeline aims to process and analyze log files by establishing a flow that begins with a Kafka producer, followed by a Kafka consumer that performs transformations using Spark. The transformed data is then converted into the Parquet file format and stored in HDFS.
-
producer.py- script to read & inject data into a Kafka topic line by lineArguments:
--bootstrap-servers: Kafka bootstrap servers. Default islocalhost:9092.--topic: Kafka topic to publish the data to. Default iskafka_test.--file: Path to the log file to be injected into Kafka. Default is40MBFile.log.--limit: Limit the number of log record to be injected. Default is -1, which will inject all lines.--reset: Clean up Kafka topic before producing new messages. Default isFalse.
-
consumer.py- script to read & process Kafka streaming dataArguments:
--bootstrap-servers: Kafka bootstrap servers. Default islocalhost:9092.--topic: Kafka topic to listen to. Default iskafka_test.--reset: Clean up previous Spark checkpoint and output before consuming new data. Default isFalse.
-
transformation.py- contains helper functions for wrangling Spark streaming dataframes
- Spin up HDFS
hdfs namenode -format
$HADOOP_HOME/sbin/start-all.sh- Inject all lines from input file into Kafka:
python producer.py --topic log_analytics --file 40MBFile.log --limit -1- Consume the streaming data from Kafka via Spark Streaming
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.0 consumer.py --topic log_analytics