The Wikipedia Events data stream traces events that occur within the MediaWiki ecosystem. The stream passes approximately 30-50 events/sec.
Use this repository to:
- Run a custom Kafka cluster, along with a Redpanda console and a Redis database instance.
- Create a Kafka producer that listens to the SSE stream and passes them as entries into a Kafka topic.
- Create a Kafka consumer that reads out messages from the topic and identifies the event "type" for each.
- Store a running count of each event type in a Redis key-value store.
What is SSE? Server-Sent Events (SSE) is a standardized, lightweight, one-way communication protocol that allows a server to push real-time data updates to a web client over a persistent HTTP connection using WebSockets.
- Learn more about the Wikipedia EventStream.
-
Fork and clone this repository to your local machine.
-
Create a Python virtual environment, activate it, and install all packages in
requirements.txt. -
Using Docker, run this command from within the directory of your clone:
docker compose up -
This will create a new Docker network and start 5 containers running on that network, along with Docker storage volumes for each Kafka node.
-
The Kafka UI is available at http://127.0.0.1:8080/
-
Once up and running, start listening to the event stream and producing messages into Kafka:
python wikipedia-stream.py -
Visit the Kafka UI and check the list of topics. Click into the
wikipedia-editstopic and watch messages as they are written. -
Next, use the consumer to read out messages, extract the
typedata field from the message JSON, and store it in Redis.python wikipedia-consumer.py -
Your consumer will process messages as quickly as it is able. Given enough time it will catch up with the total number of messages published by the producer.
