Spark Streaming is a library provided in Apache Spark for processing live data streams that is scalable, has high-throughput and is fault-tolerant. Spark Streaming can ingest data from multiple sources such as Kafka, Flume, Kinesis or TCP sockets; and process this data using complex algorithms provided in the Spark API including algorithms provided in the Spark MLlib and GraphX libraries. Processed data can be pushed to live dashboards, file systems and databases.
Apache Spark Streaming component receives live data streams from input sources such as Kafka, Flume, Kinesis etc. and divides them into batches. The Spark engine processes these input batches and produces the final stream of results in batches.
DStreams, or discretized streams, are high-level abstractions provided in Spark Streaming that represents a continuous stream of data. DStreams can be either created from input sources such as Kafka, Flume or Kinesis; or by applying high-level operations on existing DStreams.
Internally, a DStream is represented by a continuous series of RDDs. Each RDD in a DStream contains data from a certain interval.
StreamingContext object represents the entry point to a Spark streaming application and contains the batch interval for the Spark streaming program and the SparkConf object. The SparkConf object and the batch interval are set when creating a new instance of the StreamingContext object.
Spark Streaming provides two different categories of built-in streaming sources.
Basic Sources - These sources are directly available in the StreamingContext API. Example: file systems and socket connections.
Advanced Sources - These are sources such as Kafka, Flume etc. that are provided in the Spark Streaming library through extra utility classes.
map(func) - map() transformation returns a new distributed dataset from a source dataset formed by passing each element of the source through a function func.
filter() - filter() transformation returns a new distributed dataset from a source dataset formed by selecting the elements of the source on which func returns true.
flatMap() - flatMap() transformation is similar to map() function. In flatMap() each input item can be mapped to 0 or more output items.
union() - union() transformation returns a new dataset that contains the union of the elements in the source dataset and the dataset that is passed as argument to the function.
intersection() - intersection() transformation returns a new distributed dataset that contains the intersection of elements in the source dataset and the dataset that is passed as argument to the function.
distinct() - distinct() transformation returns a new distributed dataset that contains the distinct elements of the source dataset.
groupByKey() - groupByKey(func) transformation called on a dataset of (K, V) pairs and returns a dataset of (K, Iterable
reduceByKey(func) - reduceByKey(func) transformation is called on a dataset of (K, V) pairs, and returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func.
aggregateByKey(func) - aggregateByKey() transformation is called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral zero value.
sortByKey() - sortByKey() transformation is called on a dataset of (K, V) pairs which returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.
Output operations on DStreams pushes the DStream's data to external systems like a database or a file system. Following are the key operations that can be performed on DStreams.
saveAsTextFiles() - Saves the DStream's data as text file.
saveAsObjectFiles() - Saves the DStreams data as serialized Java objects.
saveAsHadoopFiles() - Saves the DStream's data as Hadoop files.
foreachRDD() - A generic output operator that applies a function, func, to each RDD generated from the DStream.