0% found this document useful (0 votes)

28 views30 pages

7 Hive

Hive is an open-source data warehouse tool built on top of Hadoop, developed by Facebook to facilitate analytics and reporting on large datasets using a SQL-like language called HiveQL. It is designed for batch processing and OLAP operations, allowing users to manage big data without needing extensive Java programming skills. However, Hive has limitations, such as not supporting real-time data handling and lacking traditional update and delete operations.

Uploaded by

asthak.tn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views30 pages

7 Hive

Uploaded by

asthak.tn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Hive

Gaurav Ojha

Visiting Faculty
What Is Hive?
• Hive is a component in Hadoop Stack.
• It is an open-source data warehouse tool
that runs on top of Hadoop.
• It was developed by Facebook and later it
is donated to the Apache foundation.
• It reads, writes, and manages big data
tables stored in HDFS or other data
sources.
What Is Hive?
• Hive doesn't offer insert, delete and update operations but it is used
to perform analytics, mining, and report generation on the large data
warehouse.
• Hive uses Hive query language similar to SQL.

• Most of the syntax is similar to the MySQL database.

• It is used for OLAP (Online Analytical Processing) purposes.

• In Hive, OLAP (Online Analytical Processing) involves performing complex

analysis and querying of large datasets stored in Hadoop's distributed
file system (HDFS) using HiveQL, the query language similar to SQL.
While Hive is not a traditional OLAP system like some relational
databases, it can be used to perform OLAP-like tasks on big data.
Why We Need Hive?
• In the year 2006, Facebook was generating 10 GB of data per
day and in 2007 its data increased by 1 TB per day.
• After few days, it is generating 15 TB of data per day.
• Initially, Facebook is using the Scribe server, Oracle
database, and Python scripts for processing large data
sets. As Facebook started gathering data then they shifted
to Hadoop as its key tool for data analysis and processing.
Why Do We Need Hive?
• Facebook is using Hadoop for managing its big
data and facing problems for ETL operations
because for each small operation they need to
write the Java programs.
• They need a lot of Java resources that are
difficult to find and Java is not easy to learn.
• So Facebook developed Hive which uses SQL-like
syntaxes that are easy to learn and write.
• Hive makes it easy for people who know SQL just
Hive Features
1. It is a Data Warehousing tool.

2. It is used for enterprise data wrangling.

3. It uses the SQL-like language HiveQL or HQL. HQL is a non-procedural and

declaration language.

4. It is used for OLAP operations.

5. It increases productivity by reducing 100 lines of Java code into 4 lines

of HQL queries.

6. It supports Table, Partition, and Bucket data structures.

7. It is built on top of Hadoop Distributed File System (HDFS)

8. Hive supports Tez, Spark, and MapReduce.

What is Data Warehousing?
• Data warehousing in Hive refers to the process of
organizing, storing, and managing large volumes
of structured and semi-structured data in a way
that facilitates efficient querying and
analysis.
• It involves using Hive, a data warehousing
infrastructure built on top of Hadoop, to create
and manage data warehouses where data can be
stored, processed, and analyzed for business
intelligence and decision-making purposes.
What Is Enterprise Data Wrangling?
• Enterprise data wrangling refers to the process of
preparing and transforming raw data from various
sources into a usable format for analysis,
reporting, and other data-related tasks within an
organization.
• It involves cleaning, structuring, enriching, and
integrating data from multiple sources to make it
accessible and actionable for business users,
analysts, and data scientists.
Non-Procedural and Declarative
Language
• Non-procedural languages focus on describing what should be
done rather than how it should be done. In other words, they
emphasize the end result rather than the step-by-step
process to achieve it. These languages abstract away the
control flow and implementation details, allowing the system
to determine the most efficient way to execute the task.
• Declarative languages are a broader category that includes
non-procedural languages. They allow users to define the
desired outcome or properties of a solution without
specifying the exact sequence of steps or algorithms to
achieve it.
Hive Architecture
• Shell/CLI: It is an interactive interface for writing queries.
• Driver: Handle session, fetch and execute the operation
• Compiler: Parse, Plan and optimize the code.
• Execution: In this phase, MapReduce jobs are submitted to Hadoop. and
jobs get executed.
• Metastore: Meta Store is a central repository that stores the
metadata. It keeps all the details about tables, partitions, and
buckets.
• Example: Suppose we have a large dataset containing information about
sales transactions for a retail company, stored in HDFS. We want to
analyze this data using Hive to gain insights into sales performance.
Shell/CLI
• The Shell or Command-Line Interface (CLI) is an interactive
interface where users can write and execute HiveQL queries
to interact with the Hive system.
• Users can use the Hive shell to submit queries, manage
tables, and perform other operations
Driver
• The Driver is responsible for handling user
sessions, interpreting queries submitted through
the shell/CLI, and coordinating the execution of
these queries.
• It interacts with other components of the Hive
architecture to process user queries.
Compiler
• The Compiler receives the HiveQL queries submitted by users
and performs several tasks:

o Parsing: It parses the query to understand its syntactic

structure and extract relevant information.
o Planning: It generates an execution plan based on the query's
logical structure, determining the sequence of operations needed
to fulfill the query.
o Optimization: It optimizes the execution plan to improve query
performance by considering factors such as data locality, join
order, and filter pushdown.
Execution
• In this phase, the optimized execution plan is translated into one
or more MapReduce jobs, which are submitted to the Hadoop cluster
for execution.
• These MapReduce jobs process the data stored in HDFS according to
the query's requirements and produce the desired result.

Example:
• Suppose the query requires aggregating sales data by product ID.
The execution phase would involve MapReduce jobs that read and
process the sales data, performing the aggregation operation as
specified in the query.
Metastore
• The Metastore is a central repository that stores metadata
about Hive objects such as databases, tables, partitions,
columns, and storage properties. It keeps track of all the
details required to manage and query data stored in Hive.

Example:
• The Metastore stores metadata about the sales table,
including its schema (columns: transaction_id, product_id,
quantity_sold, revenue), data location in HDFS,
partitioning information (if any), and any associated
storage properties.
Starting Hive

Hive
Database
• hive> show databases;

• hive> create database emp;

• hive> use emp;
Tables
hive>create table employee(
> emp_id int,
> name string,
> location string,
> dep string,
> designation string,
> salary int)
> row format delimited fields terminated by ‘,’;
[Link]
[Link]
101,Alice,New York,IT,Soft Engg,4000
102,Ali,Atlanta,Data Science,Sr Soft
Engg,4500
103,Chang,New York,Data Science,Lead,6000
104,Robin,Chicago,IT,Manager,7000
Load The Data From [Link]

• hive> load data local inpath

‘/home/cloudera/[Link]’ into
table employee;
• hive> select * from employee;
[Link]
[Link]
101,2001,Web Portal
102,2002,NER Model
103,2003,OCR Model
104,2004,Web Portal
Load [Link] into a Table

hive> load data local inpath

‘/home/cloudera/[Link]’ into
table employee;
hive> select * from project;
Join
• It is used to join two or more relations bases on
the common column. Let’s perform the JOIN
operation on employee and project table:
hive> select * from employee join project on
employee.emp_id=project.emp_id;
It can used to group the data based on given
field or column in a table. Let’s see an
example of Group By in the following query:
hive> select location, avg(salary) from
Group By employee group by location;
Subquery is a query within query
or nested query. Here, output of
one query will become input for
another query. Let’s see an
example of sub query in the
following query:
Subquery hive> select * from employee where
employee.emp_id in (select emp_id
from project where pname='Web
Portal');
Order By, Sort By
• ORDER BY: It always assures global ordering. It
is slower for large datasets because it pushes
all the data into a single reducer. In the final
output, you will get a single sorted output file.
• SORT BY: It orders the data at each reducer but
the reducer may have overlapping ranges of data.
In the final output, you will get multiple sorted
output files.
Hive Vs SQL
• In SQL, MapReduce is not Supported while in Hive MapReduce
is Supported.
• Hive does not support update command due to the limitation
and natural structure of HDFS, hive only has an insert
overwrite for an update or insert functionality.
• In HQL, the queries are in the form of objects that are
converted to SQL queries in the target database.
• SQL works with tables and columns while Hive works with
classes and their properties.
Hive Limitations
• Hive is suitable for batch processing but
doesn’t suitable for real-time data handling.
• Update and delete are not allowed, but we can
delete in bulk i.e. we can delete the entire
table but not individual observation.
• Hive is not suitable for OLTP(Online
Transactional Processing) operations.
Mentimeter

33548565

Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Hive
No ratings yet
Hive
30 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Understanding Hive Map Types
No ratings yet
Understanding Hive Map Types
49 pages
Big Data & Analytics (CSE6005) L6
No ratings yet
Big Data & Analytics (CSE6005) L6
56 pages
HIVE
No ratings yet
HIVE
18 pages
Hive Data Warehousing Overview
No ratings yet
Hive Data Warehousing Overview
9 pages
BDA Unit-5
No ratings yet
BDA Unit-5
25 pages
Unit-5 - Hive
No ratings yet
Unit-5 - Hive
31 pages
Apache Hive Overview & Architecture
No ratings yet
Apache Hive Overview & Architecture
27 pages
Hive
No ratings yet
Hive
12 pages
Apache Hive for Data Analysts
No ratings yet
Apache Hive for Data Analysts
8 pages
IET Udaipur BDA Unit-5
No ratings yet
IET Udaipur BDA Unit-5
9 pages
Big Data Analytics Module-4
No ratings yet
Big Data Analytics Module-4
39 pages
Hive Unit VI
No ratings yet
Hive Unit VI
39 pages
Introduction to Hive Data Warehousing
No ratings yet
Introduction to Hive Data Warehousing
4 pages
Unit-4 Hive
No ratings yet
Unit-4 Hive
10 pages
Unit 5 (BDC)
No ratings yet
Unit 5 (BDC)
59 pages
Unit-IV - BDA
No ratings yet
Unit-IV - BDA
42 pages
Hive Database & Analytics Guide
No ratings yet
Hive Database & Analytics Guide
10 pages
Apache HIVE
100% (1)
Apache HIVE
105 pages
BDA Unit-5
No ratings yet
BDA Unit-5
39 pages
Understanding Apache Hive Architecture
No ratings yet
Understanding Apache Hive Architecture
5 pages
Bda Unit 5 Hive Notes
No ratings yet
Bda Unit 5 Hive Notes
23 pages
Bda Unit 4 - Mam
No ratings yet
Bda Unit 4 - Mam
57 pages
Hive for Big Data Professionals
No ratings yet
Hive for Big Data Professionals
17 pages
5 - Hive
No ratings yet
5 - Hive
51 pages
Bda Report
No ratings yet
Bda Report
16 pages
Understanding Hive and Pig in Hadoop
No ratings yet
Understanding Hive and Pig in Hadoop
91 pages
Hive for Data Engineers
No ratings yet
Hive for Data Engineers
13 pages
Hive Final
No ratings yet
Hive Final
75 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
Hadoop - Hive
No ratings yet
Hadoop - Hive
190 pages
Introduction to Hive in Big Data
No ratings yet
Introduction to Hive in Big Data
65 pages
(R17a0528) Big Data Analytics-57-100
No ratings yet
(R17a0528) Big Data Analytics-57-100
44 pages
Introduction to Hive Architecture
No ratings yet
Introduction to Hive Architecture
23 pages
Unit V
No ratings yet
Unit V
23 pages
Hive Main
No ratings yet
Hive Main
33 pages
Hive
No ratings yet
Hive
52 pages
Session 3.1
No ratings yet
Session 3.1
29 pages
BDA Hive
No ratings yet
BDA Hive
22 pages
BDA Unit-5
No ratings yet
BDA Unit-5
26 pages
Hiveppt
No ratings yet
Hiveppt
29 pages
Understanding Hive in Hadoop
No ratings yet
Understanding Hive in Hadoop
17 pages
Module 3-1
No ratings yet
Module 3-1
32 pages
Unit V-Hive
No ratings yet
Unit V-Hive
10 pages
Unit 5-Hive
No ratings yet
Unit 5-Hive
18 pages
Introduction to Hive: Features & Use Cases
No ratings yet
Introduction to Hive: Features & Use Cases
20 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
33 pages
Big-Data-Unit 5
No ratings yet
Big-Data-Unit 5
54 pages
Overview of Apache Hive Features and Limitations
No ratings yet
Overview of Apache Hive Features and Limitations
35 pages
Unit Iv Part - 1
No ratings yet
Unit Iv Part - 1
60 pages
The Free Hive Book
No ratings yet
The Free Hive Book
1 page
TS410 Deck
No ratings yet
TS410 Deck
21 pages
Critical Capabilities For Analytics and Business Intelligence Platforms
100% (1)
Critical Capabilities For Analytics and Business Intelligence Platforms
73 pages
Business Intelligence in IT: Overview & Tools
No ratings yet
Business Intelligence in IT: Overview & Tools
37 pages
Data Warehousing and Mining Insights
No ratings yet
Data Warehousing and Mining Insights
6 pages
Sample Power BI Resume
No ratings yet
Sample Power BI Resume
5 pages
CCS341 Data Warehousing Unit 1 Notes New
No ratings yet
CCS341 Data Warehousing Unit 1 Notes New
17 pages
Information Systems Analysis 488
No ratings yet
Information Systems Analysis 488
20 pages
Business Intelligence 4 Adaptive Server Processing
No ratings yet
Business Intelligence 4 Adaptive Server Processing
6 pages
Essbase Optimization Guide
100% (4)
Essbase Optimization Guide
249 pages
Introduction To SAP ERP: Multiple Choice Questions
100% (1)
Introduction To SAP ERP: Multiple Choice Questions
12 pages
6th Semester Syllabus for B.Tech AI & DS
No ratings yet
6th Semester Syllabus for B.Tech AI & DS
20 pages
MODULE 3 For BSIT 3B
No ratings yet
MODULE 3 For BSIT 3B
17 pages
OLAP Cube Types: MOLAP, ROLAP, HOLAP
No ratings yet
OLAP Cube Types: MOLAP, ROLAP, HOLAP
3 pages
Data Mining Python Lab
No ratings yet
Data Mining Python Lab
208 pages
Chapter5 DataWarehouse
No ratings yet
Chapter5 DataWarehouse
77 pages
SAP HANA Guide Book
No ratings yet
SAP HANA Guide Book
110 pages
OLAP Operations and R Implementation
No ratings yet
OLAP Operations and R Implementation
6 pages
Microsoft BI Training: SSIS, SSAS, SSRS
No ratings yet
Microsoft BI Training: SSIS, SSAS, SSRS
8 pages
Data Mining Chapter 1
No ratings yet
Data Mining Chapter 1
43 pages
DSS - Final Bank Question
No ratings yet
DSS - Final Bank Question
50 pages
Prelude
No ratings yet
Prelude
9 pages
Data Warehousing & Mining Exam
No ratings yet
Data Warehousing & Mining Exam
4 pages
Dimensional Modeling Basics For Healthcare
No ratings yet
Dimensional Modeling Basics For Healthcare
27 pages
Data Warehousing Notes Aktu
50% (2)
Data Warehousing Notes Aktu
10 pages
DWDM External
No ratings yet
DWDM External
30 pages
Data Engineer & BI Specialist
No ratings yet
Data Engineer & BI Specialist
5 pages
An Overview of Analytics, and AI: Learning Objectives For Chapter 1
No ratings yet
An Overview of Analytics, and AI: Learning Objectives For Chapter 1
23 pages
HR Analytics
No ratings yet
HR Analytics
41 pages
BI Query Runtime Statistics
No ratings yet
BI Query Runtime Statistics
13 pages
Designing Distributed Systems Overview
No ratings yet
Designing Distributed Systems Overview
6 pages

7 Hive

Uploaded by

7 Hive

Uploaded by

Hive

• Most of the syntax is similar to the MySQL database.

• It is used for OLAP (Online Analytical Processing) purposes.

• In Hive, OLAP (Online Analytical Processing) involves performing complex

2. It is used for enterprise data wrangling.

3. It uses the SQL-like language HiveQL or HQL. HQL is a non-procedural and

4. It is used for OLAP operations.

5. It increases productivity by reducing 100 lines of Java code into 4 lines

6. It supports Table, Partition, and Bucket data structures.

7. It is built on top of Hadoop Distributed File System (HDFS)

8. Hive supports Tez, Spark, and MapReduce.

o Parsing: It parses the query to understand its syntactic

• hive> create database emp;

• hive> load data local inpath

hive> load data local inpath

You might also like