Sample project that processes clickstream data using Kafka and Apache Spark.
Install scala
, kafka
and apache-spark
using homebrew.
export JAVA_HOME="$(/usr/libexec/java_home)"
export PATH=$JAVA_HOME:$PATH
export SCALA_HOME="/usr/local/Cellar/scala/2.12.4" # find it using `brew info`
export PATH=$SCALA_HOME/bin:$PATH
Make sure that /usr/local/bin
is also added to your $PATH
.
Use pyenv
or similar to manage your python versions and virtual environments. After creating a virtual environment, install dependencies with: pip install -r requirements.txt
.
To use production data, copy the CSV file into data/production.csv
.
See the make
commands in for running the services locally.
- Start Zookeeper:
make zookeeper
- Start Kafka:
make kafka
- New tab, create the
clickstream
topic withmake create_topic
(unless it already exists). - Start the simple Spark stream that monitors the
clickstream
topic and prints the messages to the command line:make spark_read
- New tab, stream some sample data to Kafka:
make sample_data
- The sample data should appear in the simple stream in the previous tab.
- Make sure that your production data (a really big CSV) is found under
data/production.csv
- Start importing production data with
make production_data
- Start the categories stream with
make spark_categories
- The categories should appear counted, with the sliding interval of 10 seconds.
- The output of the previous stream should also appear writen to the file system in the
output
directory.