Spark SQL is a library provided in Apache Spark for processing structured data. Spark SQL provides various APIs that provides information about the structure of the data and the computation being performed on that data. You can use SQL as well as Dataset APIs to interact with Spark SQL.
A Dataset is a distributed collection of data that was introduced in Spark 1.6 that provides the benefits of RDDs plus the benefits of Spark SQL's optimized execution engine.
A Dataset can be constructed from Java objects and then manipulated using functional transformations such as map(), flatMap(), filter() etc.
Spark SQL supports two different methods to convert existing RDDs to Spark Datasets
1. Infer the schema using Reflection - Spark SQL can automatically convert an existing RDD of JavaBeans into a DataFrame by using reflection. The bean info which is obtained using reflection, defines the schema of the table.
2. Programatically specifying the schema - Spark SQL supports programatically specifying the schema, in cases where the Java bean cannot be defined ahead of time. This can be done in three steps.
1. Create an RDD of rows using the original RDD.
2. Create the schema represented by a StrutType matching the structure of rows in the RDD.
3. Apply the schema to the RDD of rows via createDataFrame() method provided by SparkSession.
SparkSession is the entry point to a Spark application which contains information about the application and the application configuration parameters and values. A SparkSession object can be created by using the command builder() on a SparkSession object.
SparkSession spark = SparkSession.builder()
.appName('Spark SQL Example')
.config('config.option','value')
.getOrCreate()
A DataFrame is a Dataset organized into named columns. A DataFrame is equivalent to a Relational Database Table. DataFrames can be created from a variety of sources such as structured data files, external databases, Hive tables and Resilient Distributed Datasets.
Temporary views in Spark SQL are tied to the Spark session that created the view, and will not be available once the Spark session is terminated.
Global Temporary views are not tied to a Spark session, but can be shared across multiple Spark sessions. Global Temporary views are available until the Spark application is terminated. Global Temporary view is tied to a system database global_temp, and the view must be accessed using this qualified name.
df.createGlobalTempView('global_temp.test');
spark.sql('SELECT * FROM global_temp.test').show();
spark.newSession().sql('select * from global_temp.test').show();
DataFrames has built-in functions that provide common aggregation functions such as count(), countDistinct(), avg(), max(), min() etc.
You can create a Dataset from JSON files by calling the method read().json() on a SparkSession object. Spark SQL automatically infers the schema of the JSON file and loads it as a Dataset.