Ingesting Data
1 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Lesson Objectives
After completing this lesson, students should be able to:
⬢ Describe data ingestion
⬢ Describe Batch/Bulk ingestion options
– Ambari HDFS Files View
– CLI & WebHDFS
– NFS Gateway
– Sqoop
⬢ Describe streaming framework alternatives
– Flume
– Storm
– Spark Streaming
– HDF / NiFi
2 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Ingestion Overview
Batch/Bulk Ingestion
Streaming Alternatives
3 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Data Input Options
nfs gateway
MapReduce hdfs dfs -put
WebHDFS
HDFS APIs
Vendor Connectors
4 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Real-Time Versus Batch Ingestion Workflows
Real-time and batch processing are very different.
Factors Real-Time Batch
Age Real-time – usually less than 15 Historical – usually more than
minutes old 15 minutes old
Data
Location Primarily in memory – moved Primarily on disk – moved to
to disk after processing memory for processing
Speed Sub-second to few seconds Few seconds to hours
Processing
Frequency Always running Sporadic to periodic
Who Automated systems only Human & automated systems
Clients Type Primarily operational Primarily analytical
applications applications
5 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Ingestion Overview
Batch/Bulk Ingestion
Streaming Alternatives
6 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Ambari Files View
The Files View is an Ambari Web UI plug-in providing a graphical interface to HDFS.
Create a
directory
7 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Ambari Files View
The Files View is an Ambari Web UI plug-in providing a graphical interface to HDFS.
Upload a file.
8 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Ambari Files View
The Files View is an Ambari Web UI plug-in providing a graphical interface to HDFS.
Rename
directory.
9 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Ambari Files View
The Files View is an Ambari Web UI plug-in providing a graphical interface to HDFS.
Go up one
directory.
10 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Ambari Files View
The Files View is an Ambari Web UI plug-in providing a graphical interface to HDFS.
Delete to Trash or
permanently.
11 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Ambari Files View
The Files View is an Ambari Web UI plug-in providing a graphical interface to HDFS.
Move to another
directory.
12 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Ambari Files View
The Files View is an Ambari Web UI plug-in providing a graphical interface to HDFS.
Go to directory.
13 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Ambari Files View
The Files View is an Ambari Web UI plug-in providing a graphical interface to HDFS.
Download to local
system.
14 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
The Hadoop Client
⬢ The put command to uploading data to HDFS
⬢ Perfect for inputting local files into HDFS
⬢ Useful in batch scripts
⬢ Usage:
hdfs dfs –put mylocalfile /some/hdfs/path
15 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
WebHDFS
⬢ REST API for accessing all of the HDFS file system interfaces:
– [Link]
– [Link]
– [Link]
16 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
NFS Gateway
⬢ Uses NFS standard and supports all HDFS commands
⬢ No random writes
l
oco
File writes r Prot DN
sfe
by aT ran
Dat
app user
DFSClient
NFSv3
NFS ClientProtocol
NFS NN
Gate
Client
way Data
Trans
ferP
roto
col
DN
17 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Sqoop: Database Import/Export
Relational Enterprise Document-based
Database Data Warehouse Systems
1. Client executes a sqoop 3. Plugins provide connectivity to various
command data sources
2. Sqoop executes the Map Hadoop
command as a MapReduce job tasks Cluster
on the cluster (using Map-only
tasks)
18 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
The Sqoop Import Tool
The import command has the following requirements:
⬢ Must specify a connect string using the --connect argument
⬢ Credentials can be included in the connect string, so use the --username and
--password arguments
⬢ Must specify either a table to import using --table or the result of an SQL query using
--query
19 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Importing a Table
sqoop import
--connect jdbc:mysql://host/nyse
--table StockPrices
--target-dir /data/stockprice/
--as-textfile
20 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Importing Specific Columns
sqoop import
--connect jdbc:mysql://host/nyse
--table StockPrices
--columns StockSymbol,Volume, High,ClosingPrice
--target-dir /data/dailyhighs/
--as-textfile
--split-by StockSymbol
-m 10
21 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Importing from a Query
sqoop import
--connect jdbc:mysql://host/nyse
--query "SELECT * FROM StockPrices s
WHERE [Link] >= 1000000
AND \$CONDITIONS"
--target-dir /data/highvolume/
--as-textfile
--split-by StockSymbol
22 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
The Sqoop Export Tool
⬢ The export command transfers data from HDFS to a database:
– Use --table to specify the database table
– Use --export-dir to specify the data to export
⬢ Rows are appended to the table by default
⬢ If you define --update-key, existing rows will be updated with the new
data
⬢ Use --call to invoke a stored procedure (instead of specifying the --table
argument)
23 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Exporting to a Table
sqoop export
--connect jdbc:mysql://host/mylogs
--table LogData
--export-dir /data/logfiles/
--input-fields-terminated-by "\t"
24 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Ingestion Overview
Batch/Bulk Ingestion
Streaming Alternatives
25 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Flume: Data Streaming
Channel
Log Data
Event Data
Social Media Source Sink
etc...
Flume Agent
Flume uses a Channel between the A background process
Source and Sink to decouple the
processing of events from the storing of
events.
Hadoop
cluster
26 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Storm Topology Overview
⬢ Storm data processing occurs in a bolt
topology. stre
am
⬢ A topology consists of spout and bolt m
bolt
ea
components. str
am
spout bolt stre
stream
⬢ Spouts bring data into the topology
str
ea
⬢ Bolts can (not required) persist data m
including to HDFS
spout stream
bolt
Storm topology
27 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Message Queues
Various types of message queues are often the source of the data processed by
real-time processing engines like Storm
real-time message
Storm
data source queue
operating systems,
log entries, events, Kestrel, RabbitMQ,
services and data from queue is
errors, status AMQP, Kafka, JMS,
applications, read by Storm
messages, etc. others…
sensors
28 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Spark Streaming
⬢ Streaming Applications consist of the same components as a Core application,
but add the concept of a receiver
⬢ The receiver is a process running on an executor
Dstream
Streaming Data Receiver Dstream Spark Core Outpu
Dstream
t
Spark Streaming
29 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Spark Streaming’s Micro-Batch Approach
⬢ Micro-batches are created at regular time intervals
– Receiver takes the data and starts filling up a batch
– After the batch duration completes, data is shipped off
– Each batch forms a collection of data entities that are processed together
30 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
HDF with HDP – A Complete Big Data Solution
Perishable
Hortonworks DataFlow
Insights
(HDF)
powered by Apache NiFi
Store Data Enrich
and Metadata Context
Internet
of Anything Hortonworks Data Platform (HDP)
powered by Apache Hadoop Historical
Hortonworks Data Platform Insights
powered by Apache Hadoop
Hortonworks DataFlow and the Hortonworks Data Platform
deliver the industry’s most complete Big Data solution
31 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Big Data Ingestion with HDF
HDF workflows and Storm/Spark streaming workflows can be coupled
Hadoop
Raw Network Stream
Kafka
Network Metadata Stream
Storm Spark
Data Stores
Phoenix
HDF
Syslog HBase Hive SOLR
Raw Application Logs
YARN
Other Streaming Telemetry
HDFS
32 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Knowledge Check
33 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Questions
1. What tool is used for importing data from a RDBMS?
34 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Questions
1. What tool is used for importing data from a RDBMS?
2. List two ways to easily script moving files into HDFS.
35 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Questions
1. What tool is used for importing data from a RDBMS?
2. List two ways to easily script moving files into HDFS.
3. True/False? Storm operates on micro-batches.
36 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Questions
1. What tool is used for importing data from a RDBMS?
2. List two ways to easily script moving files into HDFS.
3. True/False? Storm operates on micro-batches.
4. Name the popular open-source messaging component that is
bundled with HDP.
37 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Summary
38 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Summary
⬢ There are many different ways to ingest data including customer solutions written via
HDFS APIs as well as vendor connectors
⬢ Streaming and batch workflows can work together in a holistic system
⬢ The NFS Gateway may help some legacy systems populate data into HDFS
⬢ Sqoop’s configurable number of database connection can overload an RDBMS
⬢ The following are streaming frameworks:
– Flume
– Storm
– Spark Streaming
– HDF / NiFi
39 © Hortonworks Inc. 2011 – 2018. All Rights Reserved