Hive
Gaurav Ojha
Visiting Faculty
What Is Hive?
• Hive is a component in Hadoop Stack.
• It is an open-source data warehouse tool
that runs on top of Hadoop.
• It was developed by Facebook and later it
is donated to the Apache foundation.
• It reads, writes, and manages big data
tables stored in HDFS or other data
sources.
What Is Hive?
• Hive doesn't offer insert, delete and update operations but it is used
to perform analytics, mining, and report generation on the large data
warehouse.
• Hive uses Hive query language similar to SQL.
• Most of the syntax is similar to the MySQL database.
• It is used for OLAP (Online Analytical Processing) purposes.
• In Hive, OLAP (Online Analytical Processing) involves performing complex
analysis and querying of large datasets stored in Hadoop's distributed
file system (HDFS) using HiveQL, the query language similar to SQL.
While Hive is not a traditional OLAP system like some relational
databases, it can be used to perform OLAP-like tasks on big data.
Why We Need Hive?
• In the year 2006, Facebook was generating 10 GB of data per
day and in 2007 its data increased by 1 TB per day.
• After few days, it is generating 15 TB of data per day.
• Initially, Facebook is using the Scribe server, Oracle
database, and Python scripts for processing large data
sets. As Facebook started gathering data then they shifted
to Hadoop as its key tool for data analysis and processing.
Why Do We Need Hive?
• Facebook is using Hadoop for managing its big
data and facing problems for ETL operations
because for each small operation they need to
write the Java programs.
• They need a lot of Java resources that are
difficult to find and Java is not easy to learn.
• So Facebook developed Hive which uses SQL-like
syntaxes that are easy to learn and write.
• Hive makes it easy for people who know SQL just
Hive Features
1. It is a Data Warehousing tool.
2. It is used for enterprise data wrangling.
3. It uses the SQL-like language HiveQL or HQL. HQL is a non-procedural and
declaration language.
4. It is used for OLAP operations.
5. It increases productivity by reducing 100 lines of Java code into 4 lines
of HQL queries.
6. It supports Table, Partition, and Bucket data structures.
7. It is built on top of Hadoop Distributed File System (HDFS)
8. Hive supports Tez, Spark, and MapReduce.
What is Data Warehousing?
• Data warehousing in Hive refers to the process of
organizing, storing, and managing large volumes
of structured and semi-structured data in a way
that facilitates efficient querying and
analysis.
• It involves using Hive, a data warehousing
infrastructure built on top of Hadoop, to create
and manage data warehouses where data can be
stored, processed, and analyzed for business
intelligence and decision-making purposes.
What Is Enterprise Data Wrangling?
• Enterprise data wrangling refers to the process of
preparing and transforming raw data from various
sources into a usable format for analysis,
reporting, and other data-related tasks within an
organization.
• It involves cleaning, structuring, enriching, and
integrating data from multiple sources to make it
accessible and actionable for business users,
analysts, and data scientists.
Non-Procedural and Declarative
Language
• Non-procedural languages focus on describing what should be
done rather than how it should be done. In other words, they
emphasize the end result rather than the step-by-step
process to achieve it. These languages abstract away the
control flow and implementation details, allowing the system
to determine the most efficient way to execute the task.
• Declarative languages are a broader category that includes
non-procedural languages. They allow users to define the
desired outcome or properties of a solution without
specifying the exact sequence of steps or algorithms to
achieve it.
Hive Architecture
• Shell/CLI: It is an interactive interface for writing queries.
• Driver: Handle session, fetch and execute the operation
• Compiler: Parse, Plan and optimize the code.
• Execution: In this phase, MapReduce jobs are submitted to Hadoop. and
jobs get executed.
• Metastore: Meta Store is a central repository that stores the
metadata. It keeps all the details about tables, partitions, and
buckets.
• Example: Suppose we have a large dataset containing information about
sales transactions for a retail company, stored in HDFS. We want to
analyze this data using Hive to gain insights into sales performance.
Shell/CLI
• The Shell or Command-Line Interface (CLI) is an interactive
interface where users can write and execute HiveQL queries
to interact with the Hive system.
• Users can use the Hive shell to submit queries, manage
tables, and perform other operations
Driver
• The Driver is responsible for handling user
sessions, interpreting queries submitted through
the shell/CLI, and coordinating the execution of
these queries.
• It interacts with other components of the Hive
architecture to process user queries.
Compiler
• The Compiler receives the HiveQL queries submitted by users
and performs several tasks:
o Parsing: It parses the query to understand its syntactic
structure and extract relevant information.
o Planning: It generates an execution plan based on the query's
logical structure, determining the sequence of operations needed
to fulfill the query.
o Optimization: It optimizes the execution plan to improve query
performance by considering factors such as data locality, join
order, and filter pushdown.
Execution
• In this phase, the optimized execution plan is translated into one
or more MapReduce jobs, which are submitted to the Hadoop cluster
for execution.
• These MapReduce jobs process the data stored in HDFS according to
the query's requirements and produce the desired result.
Example:
• Suppose the query requires aggregating sales data by product ID.
The execution phase would involve MapReduce jobs that read and
process the sales data, performing the aggregation operation as
specified in the query.
Metastore
• The Metastore is a central repository that stores metadata
about Hive objects such as databases, tables, partitions,
columns, and storage properties. It keeps track of all the
details required to manage and query data stored in Hive.
Example:
• The Metastore stores metadata about the sales table,
including its schema (columns: transaction_id, product_id,
quantity_sold, revenue), data location in HDFS,
partitioning information (if any), and any associated
storage properties.
Starting Hive
Hive
Database
• hive> show databases;
• hive> create database emp;
• hive> use emp;
Tables
hive>create table employee(
> emp_id int,
> name string,
> location string,
> dep string,
> designation string,
> salary int)
> row format delimited fields terminated by ‘,’;
[Link]
[Link]
101,Alice,New York,IT,Soft Engg,4000
102,Ali,Atlanta,Data Science,Sr Soft
Engg,4500
103,Chang,New York,Data Science,Lead,6000
104,Robin,Chicago,IT,Manager,7000
Load The Data From [Link]
• hive> load data local inpath
‘/home/cloudera/[Link]’ into
table employee;
• hive> select * from employee;
[Link]
[Link]
101,2001,Web Portal
102,2002,NER Model
103,2003,OCR Model
104,2004,Web Portal
Load [Link] into a Table
hive> load data local inpath
‘/home/cloudera/[Link]’ into
table employee;
hive> select * from project;
Join
• It is used to join two or more relations bases on
the common column. Let’s perform the JOIN
operation on employee and project table:
hive> select * from employee join project on
employee.emp_id=project.emp_id;
It can used to group the data based on given
field or column in a table. Let’s see an
example of Group By in the following query:
hive> select location, avg(salary) from
Group By employee group by location;
Subquery is a query within query
or nested query. Here, output of
one query will become input for
another query. Let’s see an
example of sub query in the
following query:
Subquery hive> select * from employee where
employee.emp_id in (select emp_id
from project where pname='Web
Portal');
Order By, Sort By
• ORDER BY: It always assures global ordering. It
is slower for large datasets because it pushes
all the data into a single reducer. In the final
output, you will get a single sorted output file.
• SORT BY: It orders the data at each reducer but
the reducer may have overlapping ranges of data.
In the final output, you will get multiple sorted
output files.
Hive Vs SQL
• In SQL, MapReduce is not Supported while in Hive MapReduce
is Supported.
• Hive does not support update command due to the limitation
and natural structure of HDFS, hive only has an insert
overwrite for an update or insert functionality.
• In HQL, the queries are in the form of objects that are
converted to SQL queries in the target database.
• SQL works with tables and columns while Hive works with
classes and their properties.
Hive Limitations
• Hive is suitable for batch processing but
doesn’t suitable for real-time data handling.
• Update and delete are not allowed, but we can
delete in bulk i.e. we can delete the entire
table but not individual observation.
• Hive is not suitable for OLTP(Online
Transactional Processing) operations.
Mentimeter
33548565