Apache Spark is a fast and general-purpose cluster computing system. Apache Spark has an advanced DAG execution engine that performs in-memory computing and supports acyclic data flows. This makes Spark computations super-fast. Spark programs run up to 100X faster than Hadoop MapReduce in memory and 10X faster on disk.
Spark supports multiple programming languages. Spark provides built-in APIs in Java, Scala, Python and R programming languages.
Apache Spark provides multiple components on top of Spark core. These are Spark SQL, Spark Streaming, MLIB, GraphX
Apache Spark runs on Hadoop as well as in the cloud or standalone.
Apache Spark supports in-memory cluster computing instead of storing data on disk. Hadoop uses MapReduce that uses data on disks.
Spark programs run up to 100X faster than Hadoop MapReduce in memory and 10X faster on disk.
Apache spark provides libraries for, and has the capability for processing live data streams, batch processes, graphing and machine learning using the same cluster. Hence Apache Spark requires less components to manage. Hadoop MapReduce only provides support for batch processing. Hadoop depends on other components such as Storm, Giraph etc. for other capabilities. Hence Hadoop required more components to manage.
Apache Spark excels in processing live data streams such as Twitter streams. Hadoop MapReduce is a batch processing engine and does not have real-tine data processing capabilities.
Apache Spark ecosystem comes with four component libraries - Spark SQL, Spark Streaming, MLib and GraphX.
Spark SQL - Spark SQL makes it possible to seamlessly use SQL queries in Spark applications.
Spark Streaming - Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
MLlib - MLlib is Apache Spark's scalable machine learning library. MLlib contains many algorithms, utilities and workflows that support machine learning applications.
GraphX - GraphX is Apache Spark's API for graphs and graph-parallel computations. GraphX has a highly flexible API and comes with a variety of graph algorithms for developing graph based applications.
Apache Spark can be deployed in Amazon EC2, Mesos, Yarn and in Standalone mode
Apache Spark provides Resilient Distributed Datasets (RDD) which are fault-tolerant collection of data elements, partitioned across the nodes of the cluster, and can be operated on parallelly by Spark.
Resilient Distributed Datasets are created by transforming files in the Hadoop file system or Hadoop-supported file system.
Spark can persist RDDs to memory and reuse them across parallel operations. RDDs automatically recover from node failures.
RDDs can be created in two ways.
1. Parallelizing an existing collection - RDD can be created by parallelizing an existing data collection in the the Spark driver program. A collection can be parallelized by calling the method parallelize() on the spark context object and passing the data collection.
2. Referencing an external file - RDD can be created by referencing an external Hadoop or Hadoop-supported file system such as HDFS, HBase, Amazon S3, Cassandra etc.
Spark supports text files, SequenceFiles, and any other Hadoop InputFormat types.
RDD is created by calling the methods on spark context corresponding to the file types, and passing the data collection URL. For example textFile() for text files, sequenceFile() for sequence files etc.
Two kinds of operations can be performed on Resilient Distributed Datasets - Transformations and Actions.
Transformations - Transformations create a new dataset from an existing dataset. Transformations in Spark are lazy, i.e. transformations are performed on the dataset only when a result has to be returned to the driver program.
Actions - Actions perform computations on a dataset and return a value.
map(func) - map() transformation returns a new distributed dataset from a source dataset formed by passing each element of the source through a function func.
filter() - filter() transformation returns a new distributed dataset from a source dataset formed by selecting the elements of the source on which func returns true.
flatMap() - flatMap() transformation is similar to map() function. In flatMap() each input item can be mapped to 0 or more output items.
union() - union() transformation returns a new dataset that contains the union of the elements in the source dataset and the dataset that is passed as argument to the function.
intersection() - intersection() transformation returns a new distributed dataset that contains the intersection of elements in the source dataset and the dataset that is passed as argument to the function.
distinct() - distinct() transformation returns a new distributed dataset that contains the distinct elements of the source dataset.
groupByKey() - groupByKey(func) transformation called on a dataset of (K, V) pairs and returns a dataset of (K, Iterable
reduceByKey(func) - reduceByKey(func) transformation is called on a dataset of (K, V) pairs, and returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func.
aggregateByKey(func) - aggregateByKey() transformation is called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral zero value.
sortByKey() - sortByKey() transformation is called on a dataset of (K, V) pairs which returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.
reduce(func) - reduce() action aggregates the elements of the dataset using a function func which takes two arguments and returns one. The function should be commutative and associative so that it can be computed correctly in parallel.
collect() - collect() action returns all the elements of the dataset as an array to the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
count() - count() action returns the number of elements in the dataset.
first() - first() action returns the first element of the dataset.
countByKey() - countByKey() action returns a hashmap of (K, Int) pairs with the count of each key.
Shared variables are variables that are shared across parallel Spark functions or tasks running on different nodes.
Spark supports two types of shared variables. Broadcast variables and Accumulators
Broadcast Variables - Broadcast variables are used to cache a value in memory on all nodes.
Accumulators - Accumulators are variables that are incremented, such as counters and sums.
map(func) - map() transformation returns a new distributed dataset from a source dataset formed by passing each element of the source through a function func.
filter() - filter() transformation returns a new distributed dataset from a source dataset formed by selecting the elements of the source on which func returns true.
flatMap() - flatMap() transformation is similar to map() function. In flatMap() each input item can be mapped to 0 or more output items.
union() - union() transformation returns a new dataset that contains the union of the elements in the source dataset and the dataset that is passed as argument to the function.
intersection() - intersection() transformation returns a new distributed dataset that contains the intersection of elements in the source dataset and the dataset that is passed as argument to the function.
distinct() - distinct() transformation returns a new distributed dataset that contains the distinct elements of the source dataset.
groupByKey() - groupByKey(func) transformation called on a dataset of (K, V) pairs and returns a dataset of (K, Iterable
reduceByKey(func) - reduceByKey(func) transformation is called on a dataset of (K, V) pairs, and returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func.
aggregateByKey(func) - aggregateByKey() transformation is called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral zero value.
sortByKey() - sortByKey() transformation is called on a dataset of (K, V) pairs which returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.
Apache Spark GraphX is a component library provided in the Apache Spark ecosystem that seamlessly works with both graphs as well as with collections.
GraphX implements a variety of graph algorithms and provides a flexible API to utilize the algorithms.
Apache Spark GraphX provides the following types of operators - Property operators, Structural operators and Join operators.
Property Operators - Property operators modify the vertex or edge properties using a user defined map function and produces a new graph.
Structural Operators - Structural operators operate on the structure of an input graph and produces a new graph.
Join Operators - Join operators add data to graphs and produces a new graphs.
Property operators modify the vertex or edge properties using a user defined map function and produces a new graph.
Property operators do not impact the graph structure, but the resulting graph reuses the structural indices of the original graph.
Apache Spark GraphX provides the following property operators - mapVertices(), mapEdges(), mapTriplets()
Structural operators modify the structure of input graph and produces a new graph.
Apache Spark Graphx provides the following structural operators.
reverse()
subgraph()
mask()
groupEdges()
Join operators join data from external collections (RDDs) with graphs. Apache Spark Graphx provides the following join property operators.
joinVertices() - The joinVertices() operator joins the input RDD data with vertices and returns a new graph. The vertex properties are obtained by applying the user defined map() function to the result of the joined vertices. Vertices without a matching value in the RDD retain their original value.
outerJoinVertices() - The outerJoinVertices() operator joins the input RDD data with vertices and returns a new graph. The vertex properties are obtained by applying the user defined map() function to the all vertices, and includes ones that are not present in the input RDD.
Apache Spark Graphx provides the following neighborhood aggregation operations.
aggregateMessages() -
mapReduceTriplets() -
collectNeighbours()
Apache Spark Graphx provides various operation to build graphs from an RDD of vertices and edges.
GraphLoader.edgeListFile()
Graph.apply()
Graph.fromEdges()
Graph.fromEdgeTuples()
Apache Spark GraphX provides a set of algorithms to simplify analytics tasks.
Page Rank - PageRank measures the importance of each vertex in a graph.
Connected Components - The connected components algorithm labels each connected component of the graph with the ID of its lowest-numbered vertex.
Triangle Counting - A vertex is part of a triangle when it has two adjacent vertices with an edge between them. GraphX implements a triangle counting algorithm in the TriangleCount object that determines the number of triangles passing through each vertex, providing a measure of clustering.
MLlib is a library provided in Apache Spark for machine learning. It provides tools for common machine learning algorithms, featurizations, Pipelines, Persistence and utilities for statistics, data handling etc.
Apache Spark MLlib provides support for two correlation methods - Pearson's correlation and Spearman's correlation.
Hypothesis testing is a statistical tool that determines if a result occurred by chance or not, and whether this result is statistically significant.
Apache Spark MLlib supports Pearson's Chi-squared tests for independence.
Apache Spark MLlib provides ML Pipelines which is a chain of algorithms combined into a single workflow. ML Pipelines consists of the following key components.
DataFrame - The Apache Spark ML API uses DataFrames provided in the Spark SQL library to hold a variety of data types such as text, feature vectors, labels and predictions.
Transformer - A transformer is an algorithm that transforms one dataframe into another dataframe.
Estimators - An estimator is an algorithm that can be applied on a dataframe to produce a Transformer.
Spark MLlib machine learning library provides the following feature extraction algorithms.
TF-IDF - Term frequency-inverse document frequency (TF-IDF) is a feature extraction algorithm that determines the importance of a term to a document.
Word2Vec - Word2Vec is an estimator algorithm which takes a sequence of words and generates a Word2VecModel which can be used as features for prediction, document similarity and other similar calculations.
CountVectorizer - CountVectorization is an extraction algorithm that converts a collection of text documents to vectors of token counts, that can be passed to learning algorithms.
Tokenizer - Tokenizer breaks text into smaller terms usually words.
StopWordsRemover - Stop words remover takes a sequence of strings as input and removes all stop words for the input. Stop words are words that occur frequently in a document but carries little importance.
n-gram - An n-gram contains a sequence of n tokens, usually words, where n is an integer. NGram takes as input a sequence of strings and outputs a sequence of n-grams.
Binarizer - Binarizer is a transformation algorithm that transforms numerical features to binary features based on a threshold value. Features greater than the threshold value are set to 1 and features equal to or less than 1 are set to 0.
PCA
PolynomialExpansion - PolynomialExpansion class provided in the Spark MLlib library implements the polynomial expansion algorithm. Polynomial expansion is the process of expanding features into a polynomial space, based on n-degree combination of original dimensions.
Discrete Cosine Transform - The discrete cosine transformation transforms a sequence in the time domain to another sequence in the frequency domain.
StringIndexer - StringIndexer assigns a column of string labels to a column of indices.
IndexToString - IndexToString maps a column of label indices back to a column of original label strings.
OneHotEncoder - One-hot encoder maps a column of label indices to a column of binary vectors.
VectorIndexer - VectorIndexer helps index categorical features in dataset of vectors.
Interaction - Interaction is a transformer which takes a vector or double-valued columns and generates a single column that contains the product of all combinations of one value from each input column.
Normalizer - Normalizer is a Transformer which transforms a dataset of Vector rows, normalizing each Vector to have unit norm. This normalization can help standardize your input data and improve the behavior of learning algorithms.
StandardScaler - StandardScaler transforms a dataset of Vector rows, normalizing each feature to have unit standard deviation and/or zero mean.
MinMaxScaler - MinMaxScaler transforms a dataset of Vector rows, rescaling each feature to a specific range (often [0, 1]).
MaxAbsScaler - MaxAbsScaler transforms a dataset of Vector rows, rescaling each feature to range [-1, 1] by dividing through the maximum absolute value in each feature. It does not shift/center the data, and thus does not destroy any sparsity.
Bucketizer - Bucketizer transforms a column of continuous features to a column of feature buckets, where the buckets are specified by users.
ElementwiseProduct - ElementwiseProduct multiplies each input vector by a provided “weight” vector, using element-wise multiplication. In other words, it scales each column of the dataset by a scalar multiplier. This represents the Hadamard product between the input vector, v and transforming vector, w, to yield a result vector.
SQLTransformer - SQLTransformer implements the transformations which are defined by SQL statement.
VectorAssembler - VectorAssembler is a transformer that combines a given list of columns into a single vector column.
QuantileDiscretizer - QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features.
Imputer - The Imputer transformer completes missing values in a dataset, either using the mean or the median of the columns in which the missing values are located.
VectorSlicer - VectorSlicer is a selection algorithm that takes a feature vector as input and outputs a new feature vector that is a sub array of original features.
RFormula - RFormula selects columns specified by an RFormula. RFormula produces a vector column of features and a double or string column of label.
ChiSqSelector - ChiSqSelector, which stands for Chi-Squared feature selection, operates on labeled data with categorical features. ChiSqSelector uses the Chi-Squared test of independence to select features.
Locality Sensitive Hashing - LSH is a feature selection algorithm that hashes data points into buckets, so that the data points which are close to each other are in the same buckets with high probability, while data points that are far away from each other are very likely in different buckets. Locality Sensitive Hashing is used in clustering, approximate nearest neighbor search and outlier detection with large datasets.
Logistic Regression - Logistic regression is a classification algorithm that predicts categorical responses. Spark MLlib uses either logistic regression to predict a binary outcome by using binomial logistic regression, or multinomial logistic regression to predict a multi-class outcome.
Decision tree classifier - Decision trees are a popular family of classification and regression methods.
Random forest classifier - Random forests are a popular family of classification and regression methods.
Gradient-boosted tree classifier - Gradient-boosted trees (GBTs) are a popular classification and regression method using ensembles of decision trees.
Multilayer perception classifier - Multilayer perception classifier (MLPC) is a classifier based on the feed forward artificial neural network. MLPC consists of multiple layers of nodes. Each layer is fully connected to the next layer in the network. Nodes in the input layer represent the input data. All other nodes map inputs to outputs by a linear combination of the inputs with the node's weight w and bias b and applying an activation function.
Linear support vector machine - A support vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks.
One-vs-Rest classifier - OneVsRest is an example of a machine learning reduction for performing multi-class classification given a base classifier that can perform binary classification efficiently. It is also known as 'One-vs-All'.
Naive Bayes - Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong (naive) independence assumptions between the features. The spark.ml implementation currently supports both multinomial naive Bayes and Bernoulli naive Bayes.
Linear Regression -
Decision Tree Regression - Decision trees are a popular family of classification and regression methods.
Random Forest Regression - Random forests are a popular family of classification and regression methods.
Gradient-boosted tree regression - Gradient-boosted trees (GBTs) are a popular regression method using ensembles of decision trees
Survival Regression - Spark MLlib implements the Accelerated failure time (AFT) model which is a parametric survival regression model for censored data.
Isotonic Regression - Isotonic regression belongs to the family of regression algorithms.
K-means - k-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters.
Latent Dirichlet allocation - LDA is implemented as an Estimator that supports both EMLDAOptimizer and OnlineLDAOptimizer, and generates a LDAModel as the base model. Expert users may cast a LDAModel generated by EMLDAOptimizer to a DistributedLDAModel if needed.
Bisecting k-means - Bisecting k-means is a kind of hierarchical clustering using a divisive (or “top-down”) approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
Gaussian Mixture Model (GMM) - A Gaussian Mixture Model represents a composite distribution whereby points are drawn from one of k Gaussian sub-distributions, each with its own probability.
Collaborative filtering is mostly used for recommender systems. Spark MLlib implements the following collaborative filtering algorithms.
Explicit vs. implicit feedback - The standard approach to matrix factorization based collaborative filtering treats the entries in the user-item matrix as explicit preferences given by the user to the item, for example, users giving ratings to movies.
Scaling of the regularization parameter - Scale the regularization parameter regParam in solving each least squares problem by the number of ratings the user generated in updating user factors, or the number of ratings the product received in updating product factors.
Cold-start strategy - When making predictions using an ALSModel, it is common to encounter users and/or items in the test dataset that were not present during training the model. This typically occurs in two scenarios.
Spark SQL is a library provided in Apache Spark for processing structured data. Spark SQL provides various APIs that provides information about the structure of the data and the computation being performed on that data. You can use SQL as well as Dataset APIs to interact with Spark SQL.
A Dataset is a distributed collection of data that was introduced in Spark 1.6 that provides the benefits of RDDs plus the benefits of Spark SQL's optimized execution engine.
A Dataset can be constructed from Java objects and then manipulated using functional transformations such as map(), flatMap(), filter() etc.
Spark SQL supports two different methods to convert existing RDDs to Spark Datasets
1. Infer the schema using Reflection - Spark SQL can automatically convert an existing RDD of JavaBeans into a DataFrame by using reflection. The bean info which is obtained using reflection, defines the schema of the table.
2. Programatically specifying the schema - Spark SQL supports programatically specifying the schema, in cases where the Java bean cannot be defined ahead of time. This can be done in three steps.
1. Create an RDD of rows using the original RDD.
2. Create the schema represented by a StrutType matching the structure of rows in the RDD.
3. Apply the schema to the RDD of rows via createDataFrame() method provided by SparkSession.
SparkSession is the entry point to a Spark application which contains information about the application and the application configuration parameters and values. A SparkSession object can be created by using the command builder() on a SparkSession object.
SparkSession spark = SparkSession.builder()
.appName('Spark SQL Example')
.config('config.option','value')
.getOrCreate()
A DataFrame is a Dataset organized into named columns. A DataFrame is equivalent to a Relational Database Table. DataFrames can be created from a variety of sources such as structured data files, external databases, Hive tables and Resilient Distributed Datasets.
Temporary views in Spark SQL are tied to the Spark session that created the view, and will not be available once the Spark session is terminated.
Global Temporary views are not tied to a Spark session, but can be shared across multiple Spark sessions. Global Temporary views are available until the Spark application is terminated. Global Temporary view is tied to a system database global_temp, and the view must be accessed using this qualified name.
df.createGlobalTempView('global_temp.test');
spark.sql('SELECT * FROM global_temp.test').show();
spark.newSession().sql('select * from global_temp.test').show();
DataFrames has built-in functions that provide common aggregation functions such as count(), countDistinct(), avg(), max(), min() etc.
You can create a Dataset from JSON files by calling the method read().json() on a SparkSession object. Spark SQL automatically infers the schema of the JSON file and loads it as a Dataset.
Spark Streaming is a library provided in Apache Spark for processing live data streams that is scalable, has high-throughput and is fault-tolerant. Spark Streaming can ingest data from multiple sources such as Kafka, Flume, Kinesis or TCP sockets; and process this data using complex algorithms provided in the Spark API including algorithms provided in the Spark MLlib and GraphX libraries. Processed data can be pushed to live dashboards, file systems and databases.
Apache Spark Streaming component receives live data streams from input sources such as Kafka, Flume, Kinesis etc. and divides them into batches. The Spark engine processes these input batches and produces the final stream of results in batches.
DStreams, or discretized streams, are high-level abstractions provided in Spark Streaming that represents a continuous stream of data. DStreams can be either created from input sources such as Kafka, Flume or Kinesis; or by applying high-level operations on existing DStreams.
Internally, a DStream is represented by a continuous series of RDDs. Each RDD in a DStream contains data from a certain interval.
StreamingContext object represents the entry point to a Spark streaming application and contains the batch interval for the Spark streaming program and the SparkConf object. The SparkConf object and the batch interval are set when creating a new instance of the StreamingContext object.
Spark Streaming provides two different categories of built-in streaming sources.
Basic Sources - These sources are directly available in the StreamingContext API. Example: file systems and socket connections.
Advanced Sources - These are sources such as Kafka, Flume etc. that are provided in the Spark Streaming library through extra utility classes.
map(func) - map() transformation returns a new distributed dataset from a source dataset formed by passing each element of the source through a function func.
filter() - filter() transformation returns a new distributed dataset from a source dataset formed by selecting the elements of the source on which func returns true.
flatMap() - flatMap() transformation is similar to map() function. In flatMap() each input item can be mapped to 0 or more output items.
union() - union() transformation returns a new dataset that contains the union of the elements in the source dataset and the dataset that is passed as argument to the function.
intersection() - intersection() transformation returns a new distributed dataset that contains the intersection of elements in the source dataset and the dataset that is passed as argument to the function.
distinct() - distinct() transformation returns a new distributed dataset that contains the distinct elements of the source dataset.
groupByKey() - groupByKey(func) transformation called on a dataset of (K, V) pairs and returns a dataset of (K, Iterable
reduceByKey(func) - reduceByKey(func) transformation is called on a dataset of (K, V) pairs, and returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func.
aggregateByKey(func) - aggregateByKey() transformation is called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral zero value.
sortByKey() - sortByKey() transformation is called on a dataset of (K, V) pairs which returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.
Output operations on DStreams pushes the DStream's data to external systems like a database or a file system. Following are the key operations that can be performed on DStreams.
saveAsTextFiles() - Saves the DStream's data as text file.
saveAsObjectFiles() - Saves the DStreams data as serialized Java objects.
saveAsHadoopFiles() - Saves the DStream's data as Hadoop files.
foreachRDD() - A generic output operator that applies a function, func, to each RDD generated from the DStream.