Apache Sqoop is a tool used for transferring data between Apache Hadoop clusters and relational databases.
Sqoop was originally developed by Cloudera. The name 'Sqoop' is a short form for 'SQL-to-Hadoop'.
Sqoop can import full or partial tables from a SQL database into HDFS in a variety of formats. Sqoop can also export data from HDFS to a SQL database.
Apache Sqoop uses Hadoop MapReduce to import data from SQL database to HDFS.
Sqoop orchestrates the following three tasks to import data from a SQL database to HDFS.
1. Examines the table details and identifies the number of records that will be imported.
2. Creates and submits MapReduce jobs to Hadoop cluster.
3. Fetch records from the SQL database and mports this data to HDFS.
Apache Sqoop is a command-line utility that has various commands to import data, export data, list data etc. These commands are called tools in Sqoop. Following is the basic command-line syntax for using Apache Scoop.
$ sqoop tool-name [tool-options]
Hadoop follows the following process to import data from a relational database to HDFS.
1. Sqoop examines the relational database table from which the data has to be imported, and if possible determines the primary key.
2. Sqoop runs a boundary query on the table to determine how many data records will be imported.
3. Sqoop divides the number of records by the number of mappers, and uses this to configure the mapper tasks such that the load is distributed equally among the mappers.
You connect to a relational database with Sqoop using the '--connect' command. For example, below code connects to a mysql database.
$ sqoop --connect jdbc:mysql://myhostname/interviewgrid
--username myusername --password mypassword
You can import the data from a relational database into a specific directory using the option '--warehouse-dir.
$ sqoop import-all-tables
--connect jdbc:mysql://myhostname/interviewgrid
--username myusername --password mypassword
--warehouse-dir /interviewgrid
You can import the data from a single table using the tool or command 'import --table'. You can use the option '--warehouse-dir' to import the data into 'customerdata' HDFS directory.
$ sqoop import --table customers
--connect jdbc:mysql://myhostname/interviewgrid
--username myusername --password mypassword
--warehouse-dir /customerdata
You can import data from specific columns of a relational database table into HDFS using the option '--columns'.
Below code snipped imports the data from columns 'first_name' and 'last-name' into HDFS.
$ sqoop import --table customers
--connect jdbc:mysql://myhostname/interviewgrid
--username myusername --password mypassword
--columns 'first_name, last_name'
--warehouse-dir /interviewgrid
By default, Sqoop imports data as comma delimited text files. You can specify Sqoop to use a different delimiter by using the option '--fields-terminated-by'.
Below code snippet imports the data as tab delimited text files.
$ sqoop import --table customers
--connect jdbc:mysql://myhostname/interviewgrid
--username myusername --password mypassword
--columns 'first_name, last_name'
--warehouse-dir /interviewgrid
--fields-terminated-by ' '
You can compress the data imported by Sqoop by using the command option '--compression-codec'.
$ sqoop import --table customers
--connect jdbc:mysql://myhostname/interviewgrid
--username myusername --password mypassword
--compression-codec org.apache.hadoop.io.compress.SnappyCodec
In addition to text files, Sqoop supports importing data as Parquet files and Avro files.
You can specify Sqoop to import data as as Parquet files and Avro files by using the option '--as-parquetfile'
$ sqoop import --table customers
--connect jdbc:mysql://myhostname/interviewgrid
--username myusername --password mypassword
--as-parquetfile
Similar to 'import' tool, Sqoop provides the 'export' tool that pushes data from HDFS to a relational database.
$ sqoop export
--connect jdbc:mysql://myhostname/interviewgrid
--username myusername --password mypassword
--export-dir /interviewgrid/output
--update-mode allowinsert
--table customers