0% found this document useful (0 votes)

3 views

Big Data. Lecture. Chapter 3 Hive - Distributed Data Warehouse

Big data

Uploaded by

Abdulswamad Billow

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Big Data. Lecture. Chapter 3 Hive - Distributed Data Warehouse

Big data

Uploaded by

Abdulswamad Billow

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Foreword

⚫ Chapter 3 Hive - Distributed Data Warehouse

0 Huawei Confidential
Foreword

⚫ The Apache Hive data warehouse software helps read, write, and manage large data
sets that reside in distributed storage by using SQL. Structures can be projected onto
stored data. The command line tool and JDBC driver are provided to connect users to
Hive.

1 Huawei Confidential
Objectives

⚫ Upon completion of this course, you will be able to learn:

 Hive application scenarios and basic principles
 Hive architecture and running process
 Hive SQL statements

2 Huawei Confidential
Contents

1. Hive Overview

2. Hive Functions and Architecture

3. Basic Hive Operations

3 Huawei Confidential
Introduction to Hive
⚫ Hive is a data warehouse tool running on Hadoop and supports PB-level distributed data
query and management.
⚫ Hive features:
 Supporting flexible extraction, transformation, and load (ETL)
 Supporting multiple computing engines, such as Tez and Spark
 Supporting direct access to HDFS files and HBase
 Easy-to-use and easy-to-program

4 Huawei Confidential
Application Scenarios of Hive

⚫ User behavior analysis

Data mining ⚫ Interest partition
⚫ Area display

Non-real-time ⚫ Log analysis

data ⚫ Text analysis
analysis

Data ⚫ Daily/Weekly user clicks

summarization ⚫ Traffic statistics

⚫ Data extraction
Data ⚫ Data loading
warehouse ⚫ Data transformation

5 Huawei Confidential
Comparison Between Hive and Traditional Data Warehouses (1)

Hive Conventional Data Warehouse

Clusters are used to store data, which have a capacity upper limit.
HDFS is used to store data. Theoretically, infinite With the increase of capacity, the computing speed decreases sharply.
Storage expansion is possible. Therefore, data warehouses are applicable only to commercial
applications with small data volumes.

Execution You can select more efficient algorithms to perform queries, or take
Tez (default)
engine more optimization measures to speed up the queries.

Usage Method HQL (SQL-like) SQL

Metadata storage is independent of data storage,

Flexibility decoupling metadata and data.
Low flexibility. Data can be used for limited purposes.

Computing depends on the cluster scale and the

cluster is easy to expand. In the case of a large When the data volume is small, the data processing speed is high.
Analysis speed amount of data, computing is much faster than that When the data volume is large, the speed decreases sharply.
of a common data warehouse.

6 Huawei Confidential
Comparison Between Hive and Traditional Data Warehouses (2)

Hive Conventional Data Warehouse

Index Low efficiency High efficiency

Self-developed application models are

A set of mature report solutions is integrated
Ease of use needed, featuring high flexibility but
to facilitate data analysis.
delivering low usability.

The reliability is low. If a query fails, you start

Data is stored in HDFS, implementing high
Reliability data reliability and fault tolerance.
the task again. The data fault tolerance
depends on hardware RAID.

Environment Low dependency on hardware, applicable to Highly dependent on high-performance

dependency common machines business servers

Price Open-source product, free of charge Expensive in commercial use

7 Huawei Confidential
Advantages of Hive

Advantages

High Reliability
and SQL-like Scalability Multiple APIs
Fault Tolerance
1. Cluster 1. SQL-like syntax 1. User-defined 1. Beeline
deployment of 2. Large number of storage format 2. JDBC
HiveServer built-in 2. User-defined 3. Thrift
2. Double functions function 4. ODBC
MetaStores
3. Timeout retry
mechanism

1 2 3 4

8 Huawei Confidential
Contents

1. Hive Overview

2. Hive Functions and Architecture

3. Basic Hive Operations

9 Huawei Confidential
Hive Architecture

Hive
JDBC ODBC

Web
Thrift Server
Interface

Driver
MetaStore
(Compiler, Optimizer, Executor)

Tez MapReduce Spark

10 Huawei Confidential
Hive Running Process
⚫ The client submits the HQL command.
HQL statement
⚫ Tez executes the query.
⚫ YARN allocates resources to applications in Hive
the cluster and enables authorization for
Hive jobs in the YARN queue. Tez(default)
⚫ Hive updates data in HDFS or Hive
warehouse based on the table type. YARN
⚫ Hive returns the query result through the
JDBC connection. HDFS

11 Huawei Confidential
Data Storage Model of Hive

Database

Table Table

Partition

Bucket Bucket Partition Skewed data Normal data

Bucket Bucket

12 Huawei Confidential
Partition and Bucket
⚫ Partition: Data tables can be partitioned based on the value of a certain field.
 Each partition is a directory.
 The number of partitions is not fixed.
 Partitions or buckets can be created in a partition.

⚫ Data can be stored in different buckets.

 Each bucket is a file.
 The number of buckets is specified when creating a table. The buckets can be sorted.
 Data is hashed based on the value of a field and then stored in a bucket.

13 Huawei Confidential
Managed Table and External Table
⚫ Hive can create managed tables and external tables.
 By default, a managed table is created, and Hive moves data to the data warehouse directory.
 When an external table is created, Hive accesses data outside the warehouse directory.
 If all processing is performed by Hive, you are advised to use managed tables.
 If you want to use Hive and other tools to process the same data set, you are advised to use external
tables.

Managed Table External Table

CREATE/LOAD Data is moved to the repository directory. The data location is not moved.

DROP The metadata and data are deleted together. Only the metadata is deleted.

15 Huawei Confidential
Functions Supported by Hive
⚫ Built-in Hive Functions
 Mathematical functions, such as round(), floor(), abs(), and rand().
 Date functions, such as to_date(), month(), and day().
 String functions, such as trim(), length(), and substr().
⚫ User-Defined Function (UDF)

16 Huawei Confidential
Contents

1. Hive Overview

2. Hive Functions and Architecture

3. Basic Hive Operations

17 Huawei Confidential
Hive Usage
⚫ Running HiveServer2 and Beeline:
$ $HIVE_HOME/bin/hiveserver2

$ $HIVE_HOME/bin/beeline -u jdbc:hive2://$HS2_HOST:$HS2_PORT

⚫ Running Hcatalog:
$ $HIVE_HOME/hcatalog/sbin/hcat_server.sh

⚫ Running WebHCat (Templeton):

$ $HIVE_HOME/hcatalog/sbin/webhcat_server.sh

18 Huawei Confidential
Hive SQL Overview
⚫ DDL-Data Definition Language:
 Creates tables, modifies tables, deletes tables, partitions, and data types.
⚫ DML-Data Management Language:
 Imports and exports data.
⚫ DQL-Data Query Language:
 Performs simple queries.
 Performs complex queries such as Group by, Order by and Join.

19 Huawei Confidential
DDL Operations
-- Create a table:
hive> CREATE TABLE pokes (foo INT, bar STRING);

hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);

-- Browse the table:

hive> SHOW TABLES;

-- Describe a table:
hive> DESCRIBE invites;

-- Modify a table:
hive> ALTER TABLE events RENAME TO 3koobecaf;
hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);

20 Huawei Confidential
DML Operations
-- Load data to a table:

hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;

hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');

-- Export data to HDFS:

EXPORT TABLE invites TO '/department';

21 Huawei Confidential
DQL Operations (1)
--SELECTS and FILTERS:

hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';

hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';

--GROUP BY:

hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(*) WHERE a.foo > 0 GROUP BY a.bar;
hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;

22 Huawei Confidential
DQL Operations (2)
--MULTITABLE INSERT:

FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200;

--JOIN:
hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;

--STREAMING:

hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE
a.ds > '2008-08-09';

23 Huawei Confidential
Summary

⚫ This course introduces Hive application scenarios, basic principles, Hive architecture,
running process, and common Hive SQL statements.

24 Huawei Confidential
Quiz

1. (Multiple-choice) Which of the following scenarios are applicable to Hive? ( )

A. Online real-time data analysis
B. Data mining (including user behavior analysis, region of interest, and regional display)
C. Data summary (daily/weekly user clicks and click ranking)
D. Non-real-time analysis (log analysis and statistical analysis)
2. (Single-choice) Which of the following statements about basic Hive SQL operations is correct? ( )
A. You need to use the keyword "external" to create an external table and specify the keyword "internal" to create
a normal table.
B. The location information must be specified when an external table is created.
C. When data is loaded to Hive, the source data must be a path in HDFS.
D. Column separators can be specified when a table is created.

25 Huawei Confidential

Unit Iii
No ratings yet
Unit Iii
20 pages
Introduction To Big Data With Spark and Hadoop
No ratings yet
Introduction To Big Data With Spark and Hadoop
61 pages
English File 4e Pre Int Culture Reading Comp Answer Key
No ratings yet
English File 4e Pre Int Culture Reading Comp Answer Key
3 pages
Guyton Hall PHYSIOLOGY Chapter 8 PDF
No ratings yet
Guyton Hall PHYSIOLOGY Chapter 8 PDF
7 pages
Second Quarter - Module 35: Mathematics
No ratings yet
Second Quarter - Module 35: Mathematics
23 pages
Antipsychotic Agents: CNS Drugs
No ratings yet
Antipsychotic Agents: CNS Drugs
44 pages
Chapter_3_Hive_-_Distributed_Data_Warehouse
No ratings yet
Chapter_3_Hive_-_Distributed_Data_Warehouse
26 pages
Module 06 Hive - Distributed Data Warehouse
No ratings yet
Module 06 Hive - Distributed Data Warehouse
36 pages
BDA_HADOOP_UNIT-2
No ratings yet
BDA_HADOOP_UNIT-2
71 pages
BDA IA1 QB Solved complete - Copy
No ratings yet
BDA IA1 QB Solved complete - Copy
22 pages
Bsd1313 Chapter 4
No ratings yet
Bsd1313 Chapter 4
129 pages
It-222 Reviewer
No ratings yet
It-222 Reviewer
3 pages
Database Assignment
No ratings yet
Database Assignment
6 pages
HPE Ezmeral Data Fabric Database-A00125063enw
No ratings yet
HPE Ezmeral Data Fabric Database-A00125063enw
16 pages
Hadoop Intro
No ratings yet
Hadoop Intro
25 pages
BD Merged
No ratings yet
BD Merged
330 pages
Using Hadoop For Data Warehouse Optimization
No ratings yet
Using Hadoop For Data Warehouse Optimization
8 pages
Intr Oduction of Big Data
No ratings yet
Intr Oduction of Big Data
12 pages
HCIP-openGauss V1.0 Training Material
No ratings yet
HCIP-openGauss V1.0 Training Material
529 pages
Mi JDG Tech Overview Us95017at 201607 en
No ratings yet
Mi JDG Tech Overview Us95017at 201607 en
5 pages
OpenStack-Architecture To A Big Data Solution
No ratings yet
OpenStack-Architecture To A Big Data Solution
9 pages
Speed Up Your Queries With Hive LLAP Engine On Hadoop or in The Cloud
No ratings yet
Speed Up Your Queries With Hive LLAP Engine On Hadoop or in The Cloud
29 pages
Module 05 HBase - Distributed NoSQL Database
No ratings yet
Module 05 HBase - Distributed NoSQL Database
54 pages
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
03 Hive
No ratings yet
03 Hive
48 pages
HAWQ: A Massively Parallel Processing SQL Engine in Hadoop: Pivotal Inc
No ratings yet
HAWQ: A Massively Parallel Processing SQL Engine in Hadoop: Pivotal Inc
12 pages
Hibernate - ORM Technology For Java: Shantha Lakshmi (JCOE Team) 16/07/2008
No ratings yet
Hibernate - ORM Technology For Java: Shantha Lakshmi (JCOE Team) 16/07/2008
44 pages
CH 6 BDA
No ratings yet
CH 6 BDA
10 pages
Big Data - S
No ratings yet
Big Data - S
79 pages
6 H Data With Hive Big Data Analytics B.tech. Final Year
No ratings yet
6 H Data With Hive Big Data Analytics B.tech. Final Year
24 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
Assignment BDHhhh
No ratings yet
Assignment BDHhhh
15 pages
Big Data
No ratings yet
Big Data
51 pages
Unit 3-1
No ratings yet
Unit 3-1
41 pages
CockroachLabs CockroachDB Vs MongoDB
No ratings yet
CockroachLabs CockroachDB Vs MongoDB
1 page
Big Data Testing
No ratings yet
Big Data Testing
10 pages
Module 1 - Introduction To Big Data
100% (1)
Module 1 - Introduction To Big Data
40 pages
Big Data
No ratings yet
Big Data
17 pages
12_big_sql
No ratings yet
12_big_sql
24 pages
The Big Data Technology Landscape
No ratings yet
The Big Data Technology Landscape
36 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
HUAWEI CLOUD Services - Relational Database Service
No ratings yet
HUAWEI CLOUD Services - Relational Database Service
51 pages
01 - Intro To Big Data
No ratings yet
01 - Intro To Big Data
26 pages
Unit 2 Big Data (1) - 240328 - 162657
No ratings yet
Unit 2 Big Data (1) - 240328 - 162657
46 pages
Pig Vs Hive Big Data Analysis Showdown
No ratings yet
Pig Vs Hive Big Data Analysis Showdown
11 pages
Cloudian Hyperstore: Features/Benefits Use Cases
No ratings yet
Cloudian Hyperstore: Features/Benefits Use Cases
2 pages
Autonomous DW Cloud Ipaper 3938921
No ratings yet
Autonomous DW Cloud Ipaper 3938921
8 pages
Storage_and_Database_Services GCP
No ratings yet
Storage_and_Database_Services GCP
69 pages
Bda - 4 Unit
No ratings yet
Bda - 4 Unit
10 pages
Scaling To Millions Users
No ratings yet
Scaling To Millions Users
21 pages
2018 05 24 Kathryn Varralls Modern Data Warehouse Presentation
No ratings yet
2018 05 24 Kathryn Varralls Modern Data Warehouse Presentation
29 pages
S Pig Hive HBase Zookeeper 07
No ratings yet
S Pig Hive HBase Zookeeper 07
21 pages
BDA CW Chapter 2
No ratings yet
BDA CW Chapter 2
6 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Final Doc Presentation Hive
No ratings yet
Final Doc Presentation Hive
20 pages
Unit - 4
No ratings yet
Unit - 4
3 pages
Big Data & Analytics (CSE6005) L6 (2)
No ratings yet
Big Data & Analytics (CSE6005) L6 (2)
56 pages
Big Data QB
No ratings yet
Big Data QB
24 pages
CLOUD COMPUTING UNIT 3
No ratings yet
CLOUD COMPUTING UNIT 3
10 pages
Introduction To Hive
No ratings yet
Introduction To Hive
8 pages
Unit-5 -Hadoop.pptx
No ratings yet
Unit-5 -Hadoop.pptx
29 pages
S_Pig_Hive_HBase_Zookeeper
No ratings yet
S_Pig_Hive_HBase_Zookeeper
19 pages
Big Data Unit II
No ratings yet
Big Data Unit II
42 pages
Big-Data-A-Comprehensive-Overview
No ratings yet
Big-Data-A-Comprehensive-Overview
25 pages
Orthodontic Preparation For Orthodontic Surgery
No ratings yet
Orthodontic Preparation For Orthodontic Surgery
15 pages
Max Performance Cheat Code
No ratings yet
Max Performance Cheat Code
19 pages
Sample Social Media Policy For Employers
No ratings yet
Sample Social Media Policy For Employers
3 pages
Musuko Ga Kawaikute Shikataganai Mazoku No Hahaoya Vol.9 Chapter 200 Successor - Manganelo
No ratings yet
Musuko Ga Kawaikute Shikataganai Mazoku No Hahaoya Vol.9 Chapter 200 Successor - Manganelo
1 page
GIRASSOL
No ratings yet
GIRASSOL
1 page
Classification of Articulators: Awni Rihani, D.D.S., M.Sc.
No ratings yet
Classification of Articulators: Awni Rihani, D.D.S., M.Sc.
4 pages
Powder Metallurgy For 2025 & 2026 GATE ESE PSUs by S K Mondal
No ratings yet
Powder Metallurgy For 2025 & 2026 GATE ESE PSUs by S K Mondal
13 pages
Mrs. Vandana - Aligarh - Master Walkin Closet and Bathroom Closet - Wardrobe
No ratings yet
Mrs. Vandana - Aligarh - Master Walkin Closet and Bathroom Closet - Wardrobe
6 pages
Automatic Transmission / Trans: - Automatic Transaxle Assy (U250E)
No ratings yet
Automatic Transmission / Trans: - Automatic Transaxle Assy (U250E)
2 pages
Laminates and Veneer Endodontics.
No ratings yet
Laminates and Veneer Endodontics.
19 pages
Excel e Rice
No ratings yet
Excel e Rice
5 pages
Surface Area and Porosity Determinations by Physisorption: Measurement, Classical Theories and Quantum Theory 2Nd Edition James B. Condon - Ebook PDF
100% (3)
Surface Area and Porosity Determinations by Physisorption: Measurement, Classical Theories and Quantum Theory 2Nd Edition James B. Condon - Ebook PDF
51 pages
10 9MA0 01 9MA0 02 A Level Pure Mathematics Practice Set 10
No ratings yet
10 9MA0 01 9MA0 02 A Level Pure Mathematics Practice Set 10
5 pages
History of Strategic Management Report
No ratings yet
History of Strategic Management Report
11 pages
Aman Futures
No ratings yet
Aman Futures
2 pages
A320-X Introduction Guide
No ratings yet
A320-X Introduction Guide
33 pages
Arduino Bluetooth
No ratings yet
Arduino Bluetooth
5 pages
Multi Tenant
No ratings yet
Multi Tenant
5 pages
Hiring Manager ESG and Sustainability Services 1723318287
No ratings yet
Hiring Manager ESG and Sustainability Services 1723318287
1 page
978-1-4612-2364-1 - 8 Lubrications
No ratings yet
978-1-4612-2364-1 - 8 Lubrications
2 pages
Characterization of The Radiation Tolerant ToASt ASIC For The Readout of The PANDA MVD Strip Detector
No ratings yet
Characterization of The Radiation Tolerant ToASt ASIC For The Readout of The PANDA MVD Strip Detector
10 pages
Kitchen Specs Assignment
No ratings yet
Kitchen Specs Assignment
5 pages
Kev Sop Caltech Updated 1
No ratings yet
Kev Sop Caltech Updated 1
3 pages
2751369-206 - Rev B
No ratings yet
2751369-206 - Rev B
1 page
Loan Policy 1
No ratings yet
Loan Policy 1
13 pages
Chapter One: Thesis Report On Electrical Safety Assessment System of SQ Celsius Limited
100% (2)
Chapter One: Thesis Report On Electrical Safety Assessment System of SQ Celsius Limited
99 pages