Apache Spark is a fast and general-purpose cluster computing system. Apache Spark has an advanced DAG execution engine that performs in-memory computing and supports acyclic data flows. This makes Spark computations super-fast. Spark programs run up to 100X faster than Hadoop MapReduce in memory and 10X faster on disk.
Spark supports multiple programming languages. Spark provides built-in APIs in Java, Scala, Python and R programming languages.
Apache Spark provides multiple components on top of Spark core. These are Spark SQL, Spark Streaming, MLIB, GraphX
Apache Spark runs on Hadoop as well as in the cloud or standalone.
Apache Spark supports in-memory cluster computing instead of storing data on disk. Hadoop uses MapReduce that uses data on disks.
Spark programs run up to 100X faster than Hadoop MapReduce in memory and 10X faster on disk.
Apache spark provides libraries for, and has the capability for processing live data streams, batch processes, graphing and machine learning using the same cluster. Hence Apache Spark requires less components to manage. Hadoop MapReduce only provides support for batch processing. Hadoop depends on other components such as Storm, Giraph etc. for other capabilities. Hence Hadoop required more components to manage.
Apache Spark excels in processing live data streams such as Twitter streams. Hadoop MapReduce is a batch processing engine and does not have real-tine data processing capabilities.
Apache Spark ecosystem comes with four component libraries - Spark SQL, Spark Streaming, MLib and GraphX.
Spark SQL - Spark SQL makes it possible to seamlessly use SQL queries in Spark applications.
Spark Streaming - Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
MLlib - MLlib is Apache Spark's scalable machine learning library. MLlib contains many algorithms, utilities and workflows that support machine learning applications.
GraphX - GraphX is Apache Spark's API for graphs and graph-parallel computations. GraphX has a highly flexible API and comes with a variety of graph algorithms for developing graph based applications.
Apache Spark can be deployed in Amazon EC2, Mesos, Yarn and in Standalone mode
Apache Spark provides Resilient Distributed Datasets (RDD) which are fault-tolerant collection of data elements, partitioned across the nodes of the cluster, and can be operated on parallelly by Spark.
Resilient Distributed Datasets are created by transforming files in the Hadoop file system or Hadoop-supported file system.
Spark can persist RDDs to memory and reuse them across parallel operations. RDDs automatically recover from node failures.
RDDs can be created in two ways.
1. Parallelizing an existing collection - RDD can be created by parallelizing an existing data collection in the the Spark driver program. A collection can be parallelized by calling the method parallelize() on the spark context object and passing the data collection.
2. Referencing an external file - RDD can be created by referencing an external Hadoop or Hadoop-supported file system such as HDFS, HBase, Amazon S3, Cassandra etc.
Spark supports text files, SequenceFiles, and any other Hadoop InputFormat types.
RDD is created by calling the methods on spark context corresponding to the file types, and passing the data collection URL. For example textFile() for text files, sequenceFile() for sequence files etc.
Two kinds of operations can be performed on Resilient Distributed Datasets - Transformations and Actions.
Transformations - Transformations create a new dataset from an existing dataset. Transformations in Spark are lazy, i.e. transformations are performed on the dataset only when a result has to be returned to the driver program.
Actions - Actions perform computations on a dataset and return a value.
map(func) - map() transformation returns a new distributed dataset from a source dataset formed by passing each element of the source through a function func.
filter() - filter() transformation returns a new distributed dataset from a source dataset formed by selecting the elements of the source on which func returns true.
flatMap() - flatMap() transformation is similar to map() function. In flatMap() each input item can be mapped to 0 or more output items.
union() - union() transformation returns a new dataset that contains the union of the elements in the source dataset and the dataset that is passed as argument to the function.
intersection() - intersection() transformation returns a new distributed dataset that contains the intersection of elements in the source dataset and the dataset that is passed as argument to the function.
distinct() - distinct() transformation returns a new distributed dataset that contains the distinct elements of the source dataset.
groupByKey() - groupByKey(func) transformation called on a dataset of (K, V) pairs and returns a dataset of (K, Iterable
reduceByKey(func) - reduceByKey(func) transformation is called on a dataset of (K, V) pairs, and returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func.
aggregateByKey(func) - aggregateByKey() transformation is called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral zero value.
sortByKey() - sortByKey() transformation is called on a dataset of (K, V) pairs which returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.
reduce(func) - reduce() action aggregates the elements of the dataset using a function func which takes two arguments and returns one. The function should be commutative and associative so that it can be computed correctly in parallel.
collect() - collect() action returns all the elements of the dataset as an array to the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
count() - count() action returns the number of elements in the dataset.
first() - first() action returns the first element of the dataset.
countByKey() - countByKey() action returns a hashmap of (K, Int) pairs with the count of each key.
Shared variables are variables that are shared across parallel Spark functions or tasks running on different nodes.
Spark supports two types of shared variables. Broadcast variables and Accumulators
Broadcast Variables - Broadcast variables are used to cache a value in memory on all nodes.
Accumulators - Accumulators are variables that are incremented, such as counters and sums.
map(func) - map() transformation returns a new distributed dataset from a source dataset formed by passing each element of the source through a function func.
filter() - filter() transformation returns a new distributed dataset from a source dataset formed by selecting the elements of the source on which func returns true.
flatMap() - flatMap() transformation is similar to map() function. In flatMap() each input item can be mapped to 0 or more output items.
union() - union() transformation returns a new dataset that contains the union of the elements in the source dataset and the dataset that is passed as argument to the function.
intersection() - intersection() transformation returns a new distributed dataset that contains the intersection of elements in the source dataset and the dataset that is passed as argument to the function.
distinct() - distinct() transformation returns a new distributed dataset that contains the distinct elements of the source dataset.
groupByKey() - groupByKey(func) transformation called on a dataset of (K, V) pairs and returns a dataset of (K, Iterable
reduceByKey(func) - reduceByKey(func) transformation is called on a dataset of (K, V) pairs, and returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func.
aggregateByKey(func) - aggregateByKey() transformation is called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral zero value.
sortByKey() - sortByKey() transformation is called on a dataset of (K, V) pairs which returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.