Apache Spark SQL - Interview Questions

What is Apache Spark SQL?

FAQ

Spark SQL is a library provided in Apache Spark for processing structured data. Spark SQL provides various APIs that provides information about the structure of the data and the computation being performed on that data. You can use SQL as well as Dataset APIs to interact with Spark SQL.

What are Datasets in Apache Spark SQL?

FAQ

A Dataset is a distributed collection of data that was introduced in Spark 1.6 that provides the benefits of RDDs plus the benefits of Spark SQL's optimized execution engine.

How do you create Datasets in Apache Spark SQL?

FAQ

A Dataset can be constructed from Java objects and then manipulated using functional transformations such as map(), flatMap(), filter() etc.

How do you convert existing RDDs to Spark Datasets?

FAQ

Spark SQL supports two different methods to convert existing RDDs to Spark Datasets

1. Infer the schema using Reflection - Spark SQL can automatically convert an existing RDD of JavaBeans into a DataFrame by using reflection. The bean info which is obtained using reflection, defines the schema of the table.

2. Programatically specifying the schema - Spark SQL supports programatically specifying the schema, in cases where the Java bean cannot be defined ahead of time. This can be done in three steps.

1. Create an RDD of rows using the original RDD.
2. Create the schema represented by a StrutType matching the structure of rows in the RDD.
3. Apply the schema to the RDD of rows via createDataFrame() method provided by SparkSession.

What is SparkSession?

FAQ

SparkSession is the entry point to a Spark application which contains information about the application and the application configuration parameters and values. A SparkSession object can be created by using the command builder() on a SparkSession object.

SparkSession spark = SparkSession.builder()
.appName('Spark SQL Example')
.config('config.option','value')
.getOrCreate()

Big Data Interview Guide has over 150+ interview questions and answers. Get the guide for $44.95 only.

BUY EBOOK

What are DataFrames?

FAQ

A DataFrame is a Dataset organized into named columns. A DataFrame is equivalent to a Relational Database Table. DataFrames can be created from a variety of sources such as structured data files, external databases, Hive tables and Resilient Distributed Datasets.

What is the difference between Temp View and Global Temp View?

FAQ

Temporary views in Spark SQL are tied to the Spark session that created the view, and will not be available once the Spark session is terminated.

Global Temporary views are not tied to a Spark session, but can be shared across multiple Spark sessions. Global Temporary views are available until the Spark application is terminated. Global Temporary view is tied to a system database global_temp, and the view must be accessed using this qualified name.

df.createGlobalTempView('global_temp.test');
spark.sql('SELECT * FROM global_temp.test').show();
spark.newSession().sql('select * from global_temp.test').show();

How are aggregations performed on DataFrames?

FAQ

DataFrames has built-in functions that provide common aggregation functions such as count(), countDistinct(), avg(), max(), min() etc.

How do you create Datasets from JSON files?

FAQ

You can create a Dataset from JSON files by calling the method read().json() on a SparkSession object. Spark SQL automatically infers the schema of the JSON file and loads it as a Dataset.

Big Data Interview Guide has over 150+ interview questions and answers. Get the guide for $44.95 only.

BUY EBOOK