0% found this document useful (0 votes)
43 views56 pages

Big Data & Analytics (CSE6005) L6

Uploaded by

pradhrahulgdrive
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views56 pages

Big Data & Analytics (CSE6005) L6

Uploaded by

pradhrahulgdrive
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

SESSION 2017-2018

[Link] (CSE) YEAR: III SEMESTER: VI


INTRODUCTION TO HIVE

(CSE6005)
MODULE 2 (L6)
Presented By
Vivek Kumar
Dept of Computer Engineering & Applications
GLA University India
Agenda
Learning Objectives Learning Outcomes

Introduction to Hive
a) To understand the hive
1. To study the Hive architecture.
Architecture b) To create databases, tables
and execute data
2. To study the Hive File manipulation language
format statements on it.
c) To differentiate between
3. To study the Hive Query static and dynamic
Language partitions.
d) To differentiate between
managed and external
tables.
Agenda

 What is Hive?
 Hive Architecture
 Hive Data Types
 Primitive Data Types
 Collection Data Types
 Hive File Format
 Text File
 Sequential File
 RCFile (Record Columnar File)
Agenda …

 Hive Query Language


 DDL (Data Definition Language) Statements
 DML (Data Manipulation Language)
Statements
 Database
 Tables
 Partitions
 Buckets
 Aggregation
 Group BY and Having
 SERDER
Case Study: Retail
 Major Indian retailers
include FutureGroup, Reliance
Industries, Tata Group and Aditya Birla
Group are using Hive.
 One of the retail groups, let’s call it BigX,
wanted their last 5 years semi-
structured dataset to be analyzed for
trends and patterns.
 Let us see how we can solve their
problem using Hadoop.
Case Study: Retail cont..
About BigX
 BigX is a chain of hypermarket in India.

Currently there are 220+ stores across


85 cities and towns in India and employs
35,000+ people. Its annual revenue for
the year 2011 was USD 1 Billion. It offers
a wide range of products including
fashion and apparels, food products,
books, furniture, electronics, health care,
general merchandise and entertainment
sections.
Case Study: Retail cont..
Problem Scenario
1. One of BigX log datasets that needs to
be analyzed was approximately 12TB in
overall size and holds 5 years of vital
information in semi structured form.
Case Study: Retail cont..
2. Traditional business intelligence (BI)
tools are good up to a certain degree,
usually several hundreds of gigabytes.
But when the scale is of the order of
terabytes and petabytes, these
frameworks become inefficient. Also, BI
tools work best when data is present in
a known pre-defined schema. The
particular dataset from BigX was mostly
logs which didn’t conform to any
specific schema.
Case Study: Retail cont..
3. It took around 12+ hours to move the
data into their Business Intelligence
systems bi-weekly. BigX wanted to
reduce this time drastically.
4. Querying such large data set was taking
too long
Case Study: Retail cont..
Solution
 This is where Hadoop shines in all its

glory as a solution. Since the size of the


logs dataset is 12TB, at such a large
scale, the problem is 2-fold:
 Problem 1: Moving the logs dataset to

HDFS periodically
 Problem 2: Performing the analysis on

this HDFS dataset


Case Study: Retail cont..
Solution of
Problem1
 Since logs are

unstructured in
this case,
Sqoop was of
little or no use.
So Flume was
used to move
the log data
periodically into
Case Study: Retail cont..
Solution of Problem2
 Hive is a data warehouse infrastructure built on
top of Hadoop for providing data summarization,
query and analysis. It provides an SQL-like
language called HiveQL and converts the query
into MapReduce tasks.
Hive in this Case Study
 Hive uses “Schema on Read” unlike a
traditional database which uses
“Schema on Write”.
 While reading log files, the simplest
recommended approach during Hive
table creation is to use a RegexSerDe.
 By default, Hive metadata is usually
stored in an embeddedDerbydatabase
which allows only one user to issue
queries. This is not ideal for production
Conclusion- Case Study:
Retail
 Using the Hadoop system, log transfer
time was reduced to ~3 hours bi-weekly
and querying time also was significantly
improved.
 Thanks to Vijay, for case study, Big Data
Lead at 8KMiles, holds M. Tech in
Information Retrieval from IIIT-B.
 [Link]
retail-analysis/
What is Hive?
 Hive is a Data Warehousing tool. Hive is
used to query structured data built on
top of Hadoop.
 Facebook created Hive component to
manage their ever-growing volumes of
data. Hive makes use of the following:
1. HDFS for Storage
2. MapReduce for execution
3. Stores metadata in an RDBMS.
What is Hive ?

 Apache Hive is a popular SQL interface


for batch processing on Hadoop.
 Hadoop was built to organize and store
massive amounts of data.
 Hive gives another way to access Data
inside the cluster in easy, quick way.
 Hive provides a query language
called HiveQL that closely resembles the
common Structured Query
Language (SQL) standard.
 Hive was one of the earliest project to
bring higher-level languages to Apache
Hadoop.
 Hive Gives ability to Analysts and Data
Scientists to access data with out being
expert in Java .
 Hive gives structure to Data on HDFS
 This interface to Hadoop
 not only accelerates the time required to
produce results from data analysis,
 it significantly broadens who can use
Hadoop and MapReduce.
 Let us take a moment to thank Facebook
team because
 Hive was developed by the Facebook Data
team and, after being used internally,
 it was contributed to the Apache Software
Foundation .
 Currently Hive is freely available as an open
What Hive is not?
 Hive is not Relational Database, it uses a
database to store meta data, but the
data that hive processes is stored in
HDFS.
 Hive is not designed for on-line
transaction processing(OLTP).
 Hive is not suited for real-time queries
and row level updates and it is best used
for batch jobs over large sets of
immutable data such as web logs.
Typical Use-Case of Hive
 Hive takes large amount of unstructured
data and place it into a structured view.
 Hive supports use cases such as Ad-hoc
queries, summarization, data analysis.
 HIVEQL can also be exchange with
custom scalar functions means user
defined functions(UDF'S),
aggregations(UDFA's) and table
functions(UDTF's)
 It converts SQL queries into MapReduce
jobs.
Features of Hive
1. It is similar to SQL.
2. HQL is easy to code.
3. Hive supports rich data types such as
structs, lists, and maps.
4. Hive supports SQL filters, group-by and
order-by clauses.
Prerequisites of Hive in
Hadoop
 The prerequisites for setting up Hive and
running queries are
1. User should have stable build of Hadoop
2. Machine Should have Java 1.6 installed

3. Basic Java Programming skills

4. Basic SQL Knowledge


 Start all the services of Hadoop using the

command $ [Link].
 Check all services are running, then use $

hive to start HIVE


Hive Integration and
Workflow
 Hourly Log
data can be
Hourly Log
stored directly
into HDFS Hadoop HDFS
 And then
datacleaning is Log
performed on Compression
the log file
 Finally Hive Hive table 2 Hive Table 1
Table can be
created to
Hive Architecture
Hive

Command-Line Hive Web Hive Server


Interface Interface (Thrift)

Driver (Query Compiler,


Metastore
Executor)

JobTracker TaskTracker

HDFS
Hadoop
Hive Architecture
The various parts are as follows:
 Hive [Link] Interface (Hive CLI):

The most commonly used interface to interact


with Hive.
 Hive Web Interface: It is a simple Graphic User

Interface to interact with Hive and to execute


query.
 Hive Server: This is an optional server. This can

be used to submit Hive Jobs from a remote client.


 JDBC / ODBC: Jobs can be submitted from a JDBC

Client. One can write a Java code to connect to


Hive and submit jobs on it.
Hive Architecture
 Driver: Hive queries are sent to the driver for
compilation, optimization and execution.
 Metastore: Hive table definitions and mappings
to the data are stored in a Metastore. A
Metastore
consists of the following:
'Metastore service: Offers interface to
the Hive.
' Database: Stores data definitions,
mappings to the data and others.
 The metadata which is stored in the metastore
includes IDs of Database, IDs of Tables, IDs of Indexes
etc, the time of creation of a Table, the Input Format
used for a Table, the Output Format used for a table
Hive Architecture
 1. Embedded Metatore: This metastore is
mainly used for unit tests. Here, only one process
is allowed to connect to the metastore at a time.
This is the default metastore for Hive. It is
Apache Derby Database. In this metastore, both
the database and the metastore service runs,
embedded in the main
Hive Server process. Figure 9.8 shows an
Embedded Mecastore.
 2. Local Metastore: Metadata can be stored in
any RDBMS component like MySQL Local
metastore allows multiple connections at a time.
In this mode, the Hive metastore service runs in
the main Hive Server process, but the metastore
Hive Architecture
 3. Remote Metastore: In this, the Hive driver
and the metastore interface run on different JVMs
(which can run on different machines as well) as
in Figure 9.10. This way the database can be fire-
walled from the Hive user and also database
credentials are completely isolated from the
users of Hive.
Hive Data Units
Hive Data Model Contd.
 Tables
- Analogous to relational tables
- Each table has a corresponding directory in

HDFS
- Data serialized and stored as files within that

directory
- Hive has default serialization built in which
supports compression and lazy deserialization
- Users can specify custom serialization –
deserialization schemes (SerDe’s)
Hive Data Model Contd.
 Partitions
- Each table can be broken into partitions
- Partitions determine distribution of data within

subdirectories
Example -
CREATE_TABLE Sales (sale_id INT, amount FLOAT)
PARTITIONED BY (country STRING, year INT, month
INT)
So each partition will be split out into different folders
like
Sales/country=US/year=2012/month=12
Hierarchy of Hive Partitions

/hivebase/
Sales
/
country=U /
S country=CANA
DA
/year=2012 /year=2012
/year=2015
/year=2014
/month=12
/month=11 /month=11
File File File
Partition
 The general definition of Partition is
horizontally dividing the data into
number of slice in a equal and
manageable manner.
 Every partition is stored as directory within
data warehouse table.
 In data warehouse this partition concept
is common but there is two types of
Partitions are available in data
warehouse concepts.
 There are
Hive Partition
 The main work of Hive partition is also
same as SQL Partition but
 the main difference between SQL partition
and hive partition is SQL partition is only
supported for single column in table but in
Hive partition it supported for Multiple
columns in a table .
 The main work of Hive partition is also
same as SQL Partition but the main
difference between SQL partition and hive
partition is SQL partition is only supported
for single column in table but in Hive
Hive Data Model Contd.

 Buckets
- Data in each partition divided into
buckets
- Based on a hash function of the column
- H(column) mod NumBuckets =
bucket number
- Each bucket is stored as a file in partition
directory
Hive Data Types
Numeric Data Type
TINYINT 1 - byte signed integer
SMALLINT 2 -byte signed integer
INT 4 - byte signed integer
BIGINT 8 - byte signed integer
FLOAT 4 - byte single-precision floating-point
DOUBLE 8 - byte double-precision floating-point number

String Types
STRING
VARCHAR Only available starting with Hive 0.12.0
CHAR Only available starting with Hive 0.13.0
Strings can be expressed in either single quotes (‘) or double quotes (“)
Miscellaneous Types
BOOLEAN
BINARY Only available starting with Hive
Hive Data Types cont..
Collection Data Types

STRUCT Similar to ‘C’ struct. Fields are accessed using dot notation.
E.g.: struct('John', 'Doe')

MAP A collection of key - value pairs. Fields are accessed using []


notation.
E.g.: map('first', 'John', 'last', 'Doe')

ARRAY Ordered sequence of same types. Fields are accessed using array
index.
E.g.: array('John', 'Doe')
Hive File Format
 Text File: The default file format is text
file.
 Sequential File: Sequential files are

flat files that store binary key-value


pairs.
 RCFile (Record Columnar File):

RCFile stores the data in Column


Oriented Manner which ensures that
Aggregation operation is not an
expensive operation.
Hive Query Language (HQL)
 Works on Databases, Tables, Partitions,
Buckets (Clusters)
 Create and manage tables and partitions.
 Support various Relational, Arithmetic, and
Logical Operators.
 Evaluate functions.
 Downloads the contents of a table to a local
directory or result of queries to HDFS
directory.
Database

 To create a database named


“STUDENTS” with comments and
database properties.
CREATE DATABASE IF NOT EXISTS
STUDENTS COMMENT 'STUDENT
Details' WITH DBPROPERTIES
('creator' = 'JOHN');
Database

 To describe a database
DESCRIBE DATABASE STUDENTS;
 To show Databases
SHOW DATABASES;

 To drop database.
DROP DATABASE STUDENTS;
Tables
 There are two types of tables in Hive:
Managed table
External table

 The difference between two is when you


drop a table:
 if it is managed table hive deletes both data
and meta data,
if it is external table hive only deletes
metadata.
 Use external keyword to create
a external table
Tables
To create managed table named
‘STUDENT’.

CREATE TABLE IF NOT EXISTS


STUDENT(rollno INT,name STRING,gpa
FLOAT) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
Tables
To create external table named
‘EXT_STUDENT’.

CREATE EXTERNAL TABLE IF NOT


EXISTS EXT_STUDENT(rollno INT,name
STRING,gpa FLOAT) ROW FORMAT
DELIMITED FIELDS TERMINATED BY '\t'
LOCATION ‘/STUDENT_INFO;
Tables
To load data into the table from file named
[Link].
LOAD DATA LOCAL INPATH
‘/root/hivedemos/[Link]'
OVERWRITE INTO TABLE
EXT_STUDENT;

To retrieve the student details from


“EXT_STUDENT” table.
SELECT * from EXT_STUDENT;
Table ALTER Operations
 ALTER TABLE mytablename RENAME to
mt;
 ALTER TABLE mytable ADD COLOUMNS (mycol
STRING);
 ALTER TABLE name RENAME TO
new_name
 ALTER TABLE name DROP [COLUMN]
column_name
 ALTER TABLE name CHANGE
column_name new_name new_type
 ALTER TABLE name REPLACE COLUMNS
Partitions
 Partitions split the larger dataset into more meaningful chunks.
 Hive provides two kinds of partitions: Static Partition and Dynamic
Partition.
• To create static partition based on “gpa” column.
CREATE TABLE IF NOT EXISTS
STATIC_PART_STUDENT (rollno INT, name
STRING) PARTITIONED BY (gpa FLOAT) ROW
FORMAT DELIMITED FIELDS TERMINATED BY '\t';
Load data into partition table from table.
INSERT OVERWRITE TABLE STATIC_PART_STUDENT
PARTITION (gpa =4.0) SELECT rollno, name from
EXT_STUDENT where gpa=4.0;
Partitions
• To create dynamic partition on column date.
CREATE TABLE IF NOT EXISTS
DYNAMIC_PART_STUDENT(rollno INT, name STRING)
PARTITIONED BY (gpa FLOAT) ROW FORMAT
DELIMITED FIELDS TERMINATED BY '\t';
To load data into a dynamic partition table from table.
SET [Link] = true;
SET [Link] = nonstrict;
Note: The dynamic partition strict mode requires at least one
static partition column. To turn this off,
set [Link]=nonstrict
INSERT OVERWRITE TABLE DYNAMIC_PART_STUDENT
PARTITION (gpa) SELECT rollno,name,gpa from
EXT_STUDENT;
Buckets
 Tables or partitions are sub-divided
into buckets, to provide extra structure
to the data that may be used for more
efficient querying. Bucketing works
based on the value of hash function of
some column of a table.
 We can add partitions to a table by
altering the table. Let us assume we
have a table called employee with fields
such as Id, Name, Salary, Designation,
Dept, and yoj.
Buckets
• To create a bucketed table having 3 buckets.
CREATE TABLE IF NOT EXISTS STUDENT_BUCKET (rollno
INT,name STRING,grade FLOAT)
CLUSTERED BY (grade) into 3 buckets;
Load data to bucketed table.
FROM STUDENT
INSERT OVERWRITE TABLE STUDENT_BUCKET
SELECT rollno,name,grade;
To display the content of first bucket.
SELECT DISTINCT GRADE FROM STUDENT_BUCKET
TABLESAMPLE(BUCKET 1 OUT OF 3 ON GRADE);
Aggregations
 Hive supports aggregation functions like
avg, count, etc.
 To write the average and count
aggregation function.
SELECT avg(gpa) FROM STUDENT;
SELECT count(*) FROM STUDENT;
Group by and Having

To write group by and having function.

SELECT rollno, name,gpa


FROM STUDENT
GROUP BY rollno,name,gpa
HAVING gpa > 4.0;
SerDer
 SerDer stands for Serializer/Deserializer.
 Contains the logic to convert
unstructured data into records.
 Implemented using Java.
 Serializers are used at the time of
writing.
 Deserializers are used at query time
(SELECT Statement).
Fill in the blanks
 The metastore consists of ______________
and a ______________.
 The most commonly used interface to
interact with Hive is ______________.
 The default metastore for Hive is
______________.
 Metastore contains ______________ of Hive
tables.
 ______________ is responsible for
compilation, optimization, and execution
of Hive queries.

You might also like