Clickstream

Sample project that processes clickstream data using Kafka and Apache Spark.

Install scala, kafka and apache-spark using homebrew.

export JAVA_HOME="$(/usr/libexec/java_home)"
export PATH=$JAVA_HOME:$PATH

export SCALA_HOME="/usr/local/Cellar/scala/2.12.4"  # find it using `brew info`
export PATH=$SCALA_HOME/bin:$PATH

Make sure that /usr/local/bin is also added to your $PATH.

Use pyenv or similar to manage your python versions and virtual environments. After creating a virtual environment, install dependencies with: pip install -r requirements.txt.

To use production data, copy the CSV file into data/production.csv.

Quickstart

See the make commands in for running the services locally.

iterm walkthrough

Start Zookeeper: make zookeeper
Start Kafka: make kafka
New tab, create the clickstream topic with make create_topic (unless it already exists).
Start the simple Spark stream that monitors the clickstream topic and prints the messages to the command line: make spark_read
New tab, stream some sample data to Kafka: make sample_data
The sample data should appear in the simple stream in the previous tab.
Make sure that your production data (a really big CSV) is found under data/production.csv
Start importing production data with make production_data
Start the categories stream with make spark_categories
The categories should appear counted, with the sliding interval of 10 seconds.
The output of the previous stream should also appear writen to the file system in the output directory.

TODO

Remove log4j logs! Could not find a way to use the with the spark-submit command.
Continue with setting up Docker. Likely by creating a new Docker image, with Spark and Python3 on it.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
clickstream		clickstream
config		config
data		data
output		output
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
background.png		background.png
docker-compose.yml		docker-compose.yml
kafka_producer.py		kafka_producer.py
requirements.in		requirements.in
requirements.txt		requirements.txt
spark_categories.py		spark_categories.py
spark_read.py		spark_read.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clickstream

Quickstart

iterm walkthrough

TODO

About

Releases

Packages

Languages

kibernick/clickstream

Folders and files

Latest commit

History

Repository files navigation

Clickstream

Quickstart

iterm walkthrough

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages