Skip to content
dbrambilla edited this page Jun 6, 2015 · 1 revision

Flume Integration

There are two approaches to integrate flume with Spark Streaming

  1. Push-based approach
  2. Pull-based approach

Push-based approach

In this approach, Spark Streaming sets up a receiver that acts as an Avro agent for Flume. You need:

  1. a Spark worker to run on a specific machine (used in Flume configuration)
  2. create an Avro sink in your Flume configuration to push data to a port on that machine
     agent.sinks = avroSink
     agent.sinks.avroSink.type = avro
     agent.sinks.avroSink.channel = memoryChannel
     agent.sinks.avroSink.hostname = localhost
     agent.sinks.avroSink.port = 33333

Pull-based approach

Instead of Flume pushing data directly to Spark Streaming, this approach runs a custom Flume sink allowing:

  • Flume to push data into the sink, and data stays buffered
  • Spark Streaming uses a reliable Flume receiver and transaction to pull data from the sink. This solution guarantees that a transaction succeeds only after data is recevide and replicated by Spark Streaming Therefore this solution guarantees stronger reliability and fault-tolerance and should be preferred when these requirements are mandatory, the difference with respect to the push-based approach is that you are required to configure Flume to run a custom sink.

To setup this configuration you need to:

  • select a machine that will run the custom sink in a Flume agent, this is where the Flume pipeline is configured to send data.
  • the Spark Streaming - Flume integration jar contains the custom sink implementation and it must be used to configure a Flume sink like
  agent.sinks = spark
  agent.sinks.spark.type = org.apache.spark.streaming.flume.sink.SparkSink
  agent.sinks.spark.hostname = localhost
  agent.sinks.spark.port = 33333
  agent.sinks.spark.channel = memoryChannel

Examples

Examples for these approaches are:

  • FlumeMultiPullBased
  • FlumeSinglePullBased
  • FlumeMultiPushBased
  • FlumeSinglePushBased

examples of flume configurations are provided in resources folder, you can also find a start.sh script that can be used to start Flume agent.

Push-based

To execute push-based examples you need to:

  1. start Spark Streaming example. It creates a sink to which flume will connect to
  2. start Flume pipeline, the provided configurations use a Flume source that monitors a file for new input lines
Pull-based

To execute pull-based examples you need to:

  1. start Flume agent, it creates the pipeline with the configured custom sink
  2. start Spark Streaming example. It connects to the custom sink to retrieve data

Clone this wiki locally