Flume Integration

There are two approaches to integrate flume with Spark Streaming

Push-based approach
Pull-based approach

Push-based approach

In this approach, Spark Streaming sets up a receiver that acts as an Avro agent for Flume. You need:

a Spark worker to run on a specific machine (used in Flume configuration)
create an Avro sink in your Flume configuration to push data to a port on that machine

     agent.sinks = avroSink
     agent.sinks.avroSink.type = avro
     agent.sinks.avroSink.channel = memoryChannel
     agent.sinks.avroSink.hostname = localhost
     agent.sinks.avroSink.port = 33333

Pull-based approach

Instead of Flume pushing data directly to Spark Streaming, this approach runs a custom Flume sink allowing:

Flume to push data into the sink, and data stays buffered
Spark Streaming uses a reliable Flume receiver and transaction to pull data from the sink. This solution guarantees that a transaction succeeds only after data is recevide and replicated by Spark Streaming Therefore this solution guarantees stronger reliability and fault-tolerance and should be preferred when these requirements are mandatory, the difference with respect to the push-based approach is that you are required to configure Flume to run a custom sink.

To setup this configuration you need to:

select a machine that will run the custom sink in a Flume agent, this is where the Flume pipeline is configured to send data.
the Spark Streaming - Flume integration jar contains the custom sink implementation and it must be used to configure a Flume sink like

  agent.sinks = spark
  agent.sinks.spark.type = org.apache.spark.streaming.flume.sink.SparkSink
  agent.sinks.spark.hostname = localhost
  agent.sinks.spark.port = 33333
  agent.sinks.spark.channel = memoryChannel

Examples

Examples for these approaches are:

FlumeMultiPullBased
FlumeSinglePullBased
FlumeMultiPushBased
FlumeSinglePushBased

examples of flume configurations are provided in resources folder, you can also find a start.sh script that can be used to start Flume agent.

Push-based

To execute push-based examples you need to:

start Spark Streaming example. It creates a sink to which flume will connect to
start Flume pipeline, the provided configurations use a Flume source that monitors a file for new input lines

Pull-based

To execute pull-based examples you need to:

start Flume agent, it creates the pipeline with the configured custom sink
start Spark Streaming example. It connects to the custom sink to retrieve data

Sparking
- Spark
  - How to...
    - [Use pySpark from CLI] (https://github.com/ContentWise/sparking/wiki/Use-Spark-(python)-from-CLI)
    - Split your sorted dataset in different files
    - [Order couple of items] (https://github.com/ContentWise/sparking/wiki/From-score-to-rank)
  - Solution for memory/disk leaks issues
    - The persist paradox
  - Best Practices
- Spark Streaming
  - How to
    - Kafka integration
    - Flume integration
  - Best Practices
Contributors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Flume Integration