Big Data. Lecture. Chapter 3 Hive - Distributed Data Warehouse
Big Data. Lecture. Chapter 3 Hive - Distributed Data Warehouse
0 Huawei Confidential
Foreword
⚫ The Apache Hive data warehouse software helps read, write, and manage large data
sets that reside in distributed storage by using SQL. Structures can be projected onto
stored data. The command line tool and JDBC driver are provided to connect users to
Hive.
1 Huawei Confidential
Objectives
2 Huawei Confidential
Contents
1. Hive Overview
3 Huawei Confidential
Introduction to Hive
⚫ Hive is a data warehouse tool running on Hadoop and supports PB-level distributed data
query and management.
⚫ Hive features:
Supporting flexible extraction, transformation, and load (ETL)
Supporting multiple computing engines, such as Tez and Spark
Supporting direct access to HDFS files and HBase
Easy-to-use and easy-to-program
4 Huawei Confidential
Application Scenarios of Hive
⚫ Data extraction
Data ⚫ Data loading
warehouse ⚫ Data transformation
5 Huawei Confidential
Comparison Between Hive and Traditional Data Warehouses (1)
Execution You can select more efficient algorithms to perform queries, or take
Tez (default)
engine more optimization measures to speed up the queries.
6 Huawei Confidential
Comparison Between Hive and Traditional Data Warehouses (2)
7 Huawei Confidential
Advantages of Hive
Advantages
High Reliability
and SQL-like Scalability Multiple APIs
Fault Tolerance
1. Cluster 1. SQL-like syntax 1. User-defined 1. Beeline
deployment of 2. Large number of storage format 2. JDBC
HiveServer built-in 2. User-defined 3. Thrift
2. Double functions function 4. ODBC
MetaStores
3. Timeout retry
mechanism
1 2 3 4
8 Huawei Confidential
Contents
1. Hive Overview
9 Huawei Confidential
Hive Architecture
Hive
JDBC ODBC
Web
Thrift Server
Interface
Driver
MetaStore
(Compiler, Optimizer, Executor)
10 Huawei Confidential
Hive Running Process
⚫ The client submits the HQL command.
HQL statement
⚫ Tez executes the query.
⚫ YARN allocates resources to applications in Hive
the cluster and enables authorization for
Hive jobs in the YARN queue. Tez(default)
⚫ Hive updates data in HDFS or Hive
warehouse based on the table type. YARN
⚫ Hive returns the query result through the
JDBC connection. HDFS
11 Huawei Confidential
Data Storage Model of Hive
Database
Table Table
Partition
Bucket Bucket
12 Huawei Confidential
Partition and Bucket
⚫ Partition: Data tables can be partitioned based on the value of a certain field.
Each partition is a directory.
The number of partitions is not fixed.
Partitions or buckets can be created in a partition.
13 Huawei Confidential
Managed Table and External Table
⚫ Hive can create managed tables and external tables.
By default, a managed table is created, and Hive moves data to the data warehouse directory.
When an external table is created, Hive accesses data outside the warehouse directory.
If all processing is performed by Hive, you are advised to use managed tables.
If you want to use Hive and other tools to process the same data set, you are advised to use external
tables.
CREATE/LOAD Data is moved to the repository directory. The data location is not moved.
DROP The metadata and data are deleted together. Only the metadata is deleted.
15 Huawei Confidential
Functions Supported by Hive
⚫ Built-in Hive Functions
Mathematical functions, such as round(), floor(), abs(), and rand().
Date functions, such as to_date(), month(), and day().
String functions, such as trim(), length(), and substr().
⚫ User-Defined Function (UDF)
16 Huawei Confidential
Contents
1. Hive Overview
17 Huawei Confidential
Hive Usage
⚫ Running HiveServer2 and Beeline:
$ $HIVE_HOME/bin/hiveserver2
$ $HIVE_HOME/bin/beeline -u jdbc:hive2://$HS2_HOST:$HS2_PORT
⚫ Running Hcatalog:
$ $HIVE_HOME/hcatalog/sbin/hcat_server.sh
18 Huawei Confidential
Hive SQL Overview
⚫ DDL-Data Definition Language:
Creates tables, modifies tables, deletes tables, partitions, and data types.
⚫ DML-Data Management Language:
Imports and exports data.
⚫ DQL-Data Query Language:
Performs simple queries.
Performs complex queries such as Group by, Order by and Join.
19 Huawei Confidential
DDL Operations
-- Create a table:
hive> CREATE TABLE pokes (foo INT, bar STRING);
hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);
-- Describe a table:
hive> DESCRIBE invites;
-- Modify a table:
hive> ALTER TABLE events RENAME TO 3koobecaf;
hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);
20 Huawei Confidential
DML Operations
-- Load data to a table:
hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
21 Huawei Confidential
DQL Operations (1)
--SELECTS and FILTERS:
hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';
--GROUP BY:
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(*) WHERE a.foo > 0 GROUP BY a.bar;
hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;
22 Huawei Confidential
DQL Operations (2)
--MULTITABLE INSERT:
FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200;
--JOIN:
hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;
--STREAMING:
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE
a.ds > '2008-08-09';
23 Huawei Confidential
Summary
⚫ This course introduces Hive application scenarios, basic principles, Hive architecture,
running process, and common Hive SQL statements.
24 Huawei Confidential
Quiz
25 Huawei Confidential