-
Notifications
You must be signed in to change notification settings - Fork 1
Flume Integration
dbrambilla edited this page Jun 6, 2015
·
1 revision
There are two approaches to integrate flume with Spark Streaming
- Push-based approach
- Pull-based approach
In this approach, Spark Streaming sets up a receiver that acts as an Avro agent for Flume. You need:
- a Spark worker to run on a specific machine (used in Flume configuration)
- create an Avro sink in your Flume configuration to push data to a port on that machine
agent.sinks = avroSink
agent.sinks.avroSink.type = avro
agent.sinks.avroSink.channel = memoryChannel
agent.sinks.avroSink.hostname = localhost
agent.sinks.avroSink.port = 33333
Instead of Flume pushing data directly to Spark Streaming, this approach runs a custom Flume sink allowing:
- Flume to push data into the sink, and data stays buffered
- Spark Streaming uses a reliable Flume receiver and transaction to pull data from the sink. This solution guarantees that a transaction succeeds only after data is recevide and replicated by Spark Streaming Therefore this solution guarantees stronger reliability and fault-tolerance and should be preferred when these requirements are mandatory, the difference with respect to the push-based approach is that you are required to configure Flume to run a custom sink.
To setup this configuration you need to:
- select a machine that will run the custom sink in a Flume agent, this is where the Flume pipeline is configured to send data.
- the Spark Streaming - Flume integration jar contains the custom sink implementation and it must be used to configure a Flume sink like
agent.sinks = spark
agent.sinks.spark.type = org.apache.spark.streaming.flume.sink.SparkSink
agent.sinks.spark.hostname = localhost
agent.sinks.spark.port = 33333
agent.sinks.spark.channel = memoryChannel
Examples for these approaches are:
- FlumeMultiPullBased
- FlumeSinglePullBased
- FlumeMultiPushBased
- FlumeSinglePushBased
examples of flume configurations are provided in resources folder, you can also find a start.sh script that can be used to start Flume agent.
To execute push-based examples you need to:
- start Spark Streaming example. It creates a sink to which flume will connect to
- start Flume pipeline, the provided configurations use a Flume source that monitors a file for new input lines
To execute pull-based examples you need to:
- start Flume agent, it creates the pipeline with the configured custom sink
- start Spark Streaming example. It connects to the custom sink to retrieve data
-
Sparking
- Spark
- Spark Streaming
- Contributors