0% found this document useful (0 votes)
3 views

Big Data. Lecture. Chapter 3 Hive - Distributed Data Warehouse

Big data
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Big Data. Lecture. Chapter 3 Hive - Distributed Data Warehouse

Big data
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Foreword

⚫ Chapter 3 Hive - Distributed Data Warehouse

0 Huawei Confidential
Foreword

⚫ The Apache Hive data warehouse software helps read, write, and manage large data
sets that reside in distributed storage by using SQL. Structures can be projected onto
stored data. The command line tool and JDBC driver are provided to connect users to
Hive.

1 Huawei Confidential
Objectives

⚫ Upon completion of this course, you will be able to learn:


 Hive application scenarios and basic principles
 Hive architecture and running process
 Hive SQL statements

2 Huawei Confidential
Contents

1. Hive Overview

2. Hive Functions and Architecture

3. Basic Hive Operations

3 Huawei Confidential
Introduction to Hive
⚫ Hive is a data warehouse tool running on Hadoop and supports PB-level distributed data
query and management.
⚫ Hive features:
 Supporting flexible extraction, transformation, and load (ETL)
 Supporting multiple computing engines, such as Tez and Spark
 Supporting direct access to HDFS files and HBase
 Easy-to-use and easy-to-program

4 Huawei Confidential
Application Scenarios of Hive

⚫ User behavior analysis


Data mining ⚫ Interest partition
⚫ Area display

Non-real-time ⚫ Log analysis


data ⚫ Text analysis
analysis

Data ⚫ Daily/Weekly user clicks


summarization ⚫ Traffic statistics

⚫ Data extraction
Data ⚫ Data loading
warehouse ⚫ Data transformation

5 Huawei Confidential
Comparison Between Hive and Traditional Data Warehouses (1)

Hive Conventional Data Warehouse


Clusters are used to store data, which have a capacity upper limit.
HDFS is used to store data. Theoretically, infinite With the increase of capacity, the computing speed decreases sharply.
Storage expansion is possible. Therefore, data warehouses are applicable only to commercial
applications with small data volumes.

Execution You can select more efficient algorithms to perform queries, or take
Tez (default)
engine more optimization measures to speed up the queries.

Usage Method HQL (SQL-like) SQL

Metadata storage is independent of data storage,


Flexibility decoupling metadata and data.
Low flexibility. Data can be used for limited purposes.

Computing depends on the cluster scale and the


cluster is easy to expand. In the case of a large When the data volume is small, the data processing speed is high.
Analysis speed amount of data, computing is much faster than that When the data volume is large, the speed decreases sharply.
of a common data warehouse.

6 Huawei Confidential
Comparison Between Hive and Traditional Data Warehouses (2)

Hive Conventional Data Warehouse

Index Low efficiency High efficiency

Self-developed application models are


A set of mature report solutions is integrated
Ease of use needed, featuring high flexibility but
to facilitate data analysis.
delivering low usability.

The reliability is low. If a query fails, you start


Data is stored in HDFS, implementing high
Reliability data reliability and fault tolerance.
the task again. The data fault tolerance
depends on hardware RAID.

Environment Low dependency on hardware, applicable to Highly dependent on high-performance


dependency common machines business servers

Price Open-source product, free of charge Expensive in commercial use

7 Huawei Confidential
Advantages of Hive

Advantages

High Reliability
and SQL-like Scalability Multiple APIs
Fault Tolerance
1. Cluster 1. SQL-like syntax 1. User-defined 1. Beeline
deployment of 2. Large number of storage format 2. JDBC
HiveServer built-in 2. User-defined 3. Thrift
2. Double functions function 4. ODBC
MetaStores
3. Timeout retry
mechanism

1 2 3 4

8 Huawei Confidential
Contents

1. Hive Overview

2. Hive Functions and Architecture

3. Basic Hive Operations

9 Huawei Confidential
Hive Architecture

Hive
JDBC ODBC

Web
Thrift Server
Interface

Driver
MetaStore
(Compiler, Optimizer, Executor)

Tez MapReduce Spark

10 Huawei Confidential
Hive Running Process
⚫ The client submits the HQL command.
HQL statement
⚫ Tez executes the query.
⚫ YARN allocates resources to applications in Hive
the cluster and enables authorization for
Hive jobs in the YARN queue. Tez(default)
⚫ Hive updates data in HDFS or Hive
warehouse based on the table type. YARN
⚫ Hive returns the query result through the
JDBC connection. HDFS

11 Huawei Confidential
Data Storage Model of Hive

Database

Table Table

Partition

Bucket Bucket Partition Skewed data Normal data

Bucket Bucket

12 Huawei Confidential
Partition and Bucket
⚫ Partition: Data tables can be partitioned based on the value of a certain field.
 Each partition is a directory.
 The number of partitions is not fixed.
 Partitions or buckets can be created in a partition.

⚫ Data can be stored in different buckets.


 Each bucket is a file.
 The number of buckets is specified when creating a table. The buckets can be sorted.
 Data is hashed based on the value of a field and then stored in a bucket.

13 Huawei Confidential
Managed Table and External Table
⚫ Hive can create managed tables and external tables.
 By default, a managed table is created, and Hive moves data to the data warehouse directory.
 When an external table is created, Hive accesses data outside the warehouse directory.
 If all processing is performed by Hive, you are advised to use managed tables.
 If you want to use Hive and other tools to process the same data set, you are advised to use external
tables.

Managed Table External Table

CREATE/LOAD Data is moved to the repository directory. The data location is not moved.

DROP The metadata and data are deleted together. Only the metadata is deleted.

15 Huawei Confidential
Functions Supported by Hive
⚫ Built-in Hive Functions
 Mathematical functions, such as round(), floor(), abs(), and rand().
 Date functions, such as to_date(), month(), and day().
 String functions, such as trim(), length(), and substr().
⚫ User-Defined Function (UDF)

16 Huawei Confidential
Contents

1. Hive Overview

2. Hive Functions and Architecture

3. Basic Hive Operations

17 Huawei Confidential
Hive Usage
⚫ Running HiveServer2 and Beeline:
$ $HIVE_HOME/bin/hiveserver2

$ $HIVE_HOME/bin/beeline -u jdbc:hive2://$HS2_HOST:$HS2_PORT

⚫ Running Hcatalog:
$ $HIVE_HOME/hcatalog/sbin/hcat_server.sh

⚫ Running WebHCat (Templeton):


$ $HIVE_HOME/hcatalog/sbin/webhcat_server.sh

18 Huawei Confidential
Hive SQL Overview
⚫ DDL-Data Definition Language:
 Creates tables, modifies tables, deletes tables, partitions, and data types.
⚫ DML-Data Management Language:
 Imports and exports data.
⚫ DQL-Data Query Language:
 Performs simple queries.
 Performs complex queries such as Group by, Order by and Join.

19 Huawei Confidential
DDL Operations
-- Create a table:
hive> CREATE TABLE pokes (foo INT, bar STRING);

hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);

-- Browse the table:


hive> SHOW TABLES;

-- Describe a table:
hive> DESCRIBE invites;

-- Modify a table:
hive> ALTER TABLE events RENAME TO 3koobecaf;
hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);

20 Huawei Confidential
DML Operations
-- Load data to a table:

hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;

hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');

-- Export data to HDFS:

EXPORT TABLE invites TO '/department';

21 Huawei Confidential
DQL Operations (1)
--SELECTS and FILTERS:

hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';

hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';

--GROUP BY:

hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(*) WHERE a.foo > 0 GROUP BY a.bar;
hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;

22 Huawei Confidential
DQL Operations (2)
--MULTITABLE INSERT:

FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200;

--JOIN:
hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;

--STREAMING:

hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE
a.ds > '2008-08-09';

23 Huawei Confidential
Summary

⚫ This course introduces Hive application scenarios, basic principles, Hive architecture,
running process, and common Hive SQL statements.

24 Huawei Confidential
Quiz

1. (Multiple-choice) Which of the following scenarios are applicable to Hive? ( )


A. Online real-time data analysis
B. Data mining (including user behavior analysis, region of interest, and regional display)
C. Data summary (daily/weekly user clicks and click ranking)
D. Non-real-time analysis (log analysis and statistical analysis)
2. (Single-choice) Which of the following statements about basic Hive SQL operations is correct? ( )
A. You need to use the keyword "external" to create an external table and specify the keyword "internal" to create
a normal table.
B. The location information must be specified when an external table is created.
C. When data is loaded to Hive, the source data must be a path in HDFS.
D. Column separators can be specified when a table is created.

25 Huawei Confidential

You might also like