Skip to content
dbrambilla edited this page Jun 6, 2015 · 1 revision

Spark Streaming

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, ZeroMQ, Kinesis or TCP sockets can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window.

Spark Streaming receives live input data streams and divides the data into micro batches, which are then processed by the Spark engine to generate the final stream of results in batches.

Spark Streaming is available through Maven Central Repository

Maven

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming_2.10</artifactId>
    <version>1.3.1</version>
</dependency>

SBT

libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.3.0"

For ingesting data from sources like Kafka or Flume, that are not part of the Spark Streaming core API, you need to add the corresponding artifact spark-streaming-xyz_2.10 that includes all required classes to integrate Spark Streaming with the selected source.

Kafka

Maven

<dependency>
	<groupId>org.apache.spark</groupId>
	<artifactId>spark-streaming-kafka_2.10</artifactId>
	<version>1.3.0</version>
</dependency>

SBT

libraryDependencies += "org.apache.spark" % "spark-streaming-kafka_2.10" % "1.3.0"

Flume

Maven

<dependency>
	<groupId>org.apache.spark</groupId>
	<artifactId>spark-streaming-flume_2.10</artifactId>
	<version>1.3.0</version>
</dependency>

SBT

libraryDependencies += "org.apache.spark" % "spark-streaming-flume_2.10" % "1.3.0"

Clone this wiki locally