Apache Sqoop
Apache Sqoop
What is Sqoop?
Allows easy import and export of data from structured data
stores:
o Relational Database
o Enterprise Data Warehouse
o NoSQL Datastore
Allows easy integration with Hadoop based systems:
o Hive
o HBase
o Oozie
Agenda
Motivation
Importing and exporting data using Sqoop
Provisioning Hive Metastore
Populating HBase tables
Sqoop Connectors
Current Status
Motivation
Structured data stored in Databases and EDW is not easily
accessible for analysis in Hadoop
Access to Databases and EDW from Hadoop Clusters is
problematic.
Forcing MapReduce to access data from Databases/EDWs is
repititive, error-prone and non-trivial.
Data preparation often required forefficientconsumption
by Hadoop based data pipelines.
Current methods of transferring data are inefficient/adhoc.
Enter: Sqoop
A tool to automate data transfer between structured
datastores and Hadoop.
Highlights
Uses datastore metadata to infer structure definitions
Uses MapReduce framework to transfer data in parallel
Allows structure definitions to be provisioned in Hive
metastore
Provides an extension mechanism to incorporate high
performance connectors for external systems.
Importing Data
mysql> describe ORDERS;
+-----------------+-------------+------+-----+---------+-------+
| Field
| Type
| Null | Key | Default | Extra |
+-----------------+-------------+------+-----+---------+-------+
| ORDER_NUMBER | int(11) | NO | PRI | NULL |
|
| ORDER_DATE
| datetime | NO | | NULL |
|
| REQUIRED_DATE | datetime | NO | | NULL |
|
| SHIP_DATE
| datetime | YES | | NULL |
|
| STATUS
| varchar(15) | NO | | NULL |
|
| COMMENTS
| text
| YES | | NULL |
|
| CUSTOMER_NUMBER | int(11) | NO | | NULL |
|
+-----------------+-------------+------+-----+---------+-------+
7 rows in set (0.00 sec)
Importing Data
$ sqoop import --connect jdbc:mysql://localhost/acmedb \
--table ORDERS --username test --password ****
...
Importing Data
$ hadoop fs -ls
Found 32 items
....
drwxr-xr-x - arvind staff 0 2011-09-13 19:12 /user/arvind/ORDERS
....
$ hadoop fs -ls /user/arvind/ORDERS
arvind@ap-w510:/opt/ws/apache/sqoop$ hadoop fs -ls /user/arvind/ORDERS
Found 6 items
... 0 2011-09-13 19:12 /user/arvind/ORDERS/_SUCCESS
... 0 2011-09-13 19:12 /user/arvind/ORDERS/_logs
... 8826 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00000
... 8760 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00001
... 8841 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00002
... 8671 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00003
Exporting Data
$ sqoop export --connect jdbc:mysql://localhost/acmedb \
--table ORDERS_CLEAN --username test --password **** \
--export-dir /user/arvind/ORDERS
...
INFO mapreduce.ExportJobBase: Transferred 34.7178 KB in 6.7482 seconds (5.1447 KB/
sec)
INFO mapreduce.ExportJobBase: Exported 326 records.
$
Exporting Data
Exports can optionally use Staging Tables
Map tasks populate staging table
Each map write is broken down into many transactions
Staging table is then used to populate the target table in a
single transaction
In case of failure, staging table provides insulation from
data corruption.
Sqoop Connectors
Connector Mechanism allows creation of new connectors
that improve/augment Sqoop functionality.
Bundled connectors include:
o MySQL, PostgreSQL, Oracle, SQLServer, JDBC
o Direct MySQL, Direct PostgreSQL
Regular connectors are JDBC based.
Direct Connectors use native tools for high-performance
data transfer implementation.
mysqlimport
utility.
Current Status
Sqoop is currently in Apache Incubator
Status Page
http://incubator.apache.org/projects/sqoop.html
Mailing Lists
sqoop-user@incubator.apache.org
sqoop-dev@incubator.apache.org
Release
Current shipping version is 1.3.0
Sqoop Meetup
Thank you!
Q&A