Hive Query Optimization Infinity

Well designed tables such as partitioning and bucketing tables in Hive can improve query speed and reduce processing costs. The document discusses partitioning Hive tables horizontally by fields like date or location to group related records together. It also covers bucketing tables to enable more efficient queries and sampling. Parallel query execution in Hive allows subqueries that are not interdependent to run simultaneously to improve performance.

Uploaded by

shashwat2010

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

270 views

Hive Query Optimization Infinity

Uploaded by

shashwat2010

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 13

dwivedishashwat@gmail.com http://helpmetocode.blogspot.

com

Well designed tables Partitioning Bucketing and well written queries can improve your query speed and

reduce processing cost.

Optimization on Table side

Partitioning Hive Tables:
It is a kind of horizontal slicing of data. This slicing can be

on the range, single value or a set of values. Imagine log files where each record includes a timestamp. If we partitioned by date, then records for the same date would be stored in the same partition. E.g.: Partition on date. Partition on geography location. Partition on number range.

Defining a table partition

Lets take a Apache log file example where we have log generated by web

server on visit of client. These log contains data & time information about browser and location(IP). So we can create table in hive and partition these log data using date & time and we can create sub partition of location. Which looks like :

CREATE TABLE alogs (timstamp BIGINT, detail STRING) PARTITIONED BY (date STRING, loc STRING);

Log Table

Directory Structure

/user/hive/warehouse/logs/dt=2010-01-01/country=GB/file1 /file2 /country=US/file3 /dt=2010-01-02/country=GB/file4 /country=US/file5 /file6

Hive Buckets
Bucketing Hive Tables:
Bucketing hive table result in more efficient queries.

Bucketing imposes extra structure on the table, which Hive can take advantage of when performing certain queries. The two tables are bucketed in the same way, a mapper processing a bucket of the left table knows that the matching rows in the right table are in its corresponding bucket, so it need only retrieve that bucket. Bucket may additionally be sorted by one or more columns. This allows even more efficient map-side joins, since the join of each bucket becomes an efficient merge-sort.

It makes sampling more efficient.

Parallel execution of queries

Hadoop can execute map reduce jobs in parallel and several queries executed on Hive make automatically use of this parallelism. The queries or sub queries which are not interdependent can be execute in parallel mode,like some Join queries.

Following is the example how it is done:

SET hive.exce.parallel=true; #Can be used to set this mode on

Final Result 4 Main Query 5 Query (1 & 2) & 3 Joined Join Sub query (1 & 2) Joined Join Sub query 1

2 Sub query 2

3 Sub query 3

Misc
So in the above flow, 1,2,4 can run in parallel as sub queries and

then joined finally to 3 and then to 5 and the final query result.

Since map join is faster than the common join, it's better to run the map join whenever possible. Previously, Hive users needed to give a hint in the query to specify the small table. For example, select /*+mapjoin(a)*/ * from src1 x join src2 y on x.key=y.key; Newer hive automatically converts normal join to map join.

Some examples

Which query is faster? Select count(distinct(column)) from table.

Or
Select count(*) from (select distinct(column) from table) ??

Answer
M M M M M M

Result

2nd one is faster

In first case :
Maps send each value to reducer Single reducer counts them all(over head)

In Second Case:
Map splits the values to many reducer
Each reducer generated a list Final job is to count the size of each list

Note : Singleton reducer is not always good.

Tips
Hive does not know whether query is bad.

So try to use Explain for queries which you doubt to be bad or

even dont doubt. Explain tells about following Number of jobs Number of map and reduce What job is sorting by What are the directories it will read. So explain will help to see the difference between the two or more queries for the same purpose. Job configuration and history can be studied for the query performance.

Dynamic Link Library For Developing Software PDF
No ratings yet
Dynamic Link Library For Developing Software PDF
23 pages
Best Practices For Bucketing in Spark SQL - by David Vrba - Towards Data Science
No ratings yet
Best Practices For Bucketing in Spark SQL - by David Vrba - Towards Data Science
27 pages
Apache Hive Tutorial
No ratings yet
Apache Hive Tutorial
139 pages
RE4UHD AEV Tool Tutorial
No ratings yet
RE4UHD AEV Tool Tutorial
19 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
24 Hadoop Interview Questions & Answers For MapReduce Developers - FromDev
No ratings yet
24 Hadoop Interview Questions & Answers For MapReduce Developers - FromDev
7 pages
Bigdata Notes
No ratings yet
Bigdata Notes
26 pages
Hadoop Interview Questions - Part 1
No ratings yet
Hadoop Interview Questions - Part 1
8 pages
Hadoop and Java Ques - Ans
No ratings yet
Hadoop and Java Ques - Ans
222 pages
Sqoop User Guide
No ratings yet
Sqoop User Guide
58 pages
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
No ratings yet
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
74 pages
Hadoop Interview Question
No ratings yet
Hadoop Interview Question
25 pages
Apache Airflow TRAINING12532
No ratings yet
Apache Airflow TRAINING12532
3 pages
Real Time Hadoop Interview Questions From Various Interviews
No ratings yet
Real Time Hadoop Interview Questions From Various Interviews
6 pages
Facebook Hive POC
No ratings yet
Facebook Hive POC
18 pages
Spark With Bigdata
No ratings yet
Spark With Bigdata
94 pages
Spark Concept
No ratings yet
Spark Concept
18 pages
Datatypes in Hive
No ratings yet
Datatypes in Hive
31 pages
Sqoop Interview Questions
No ratings yet
Sqoop Interview Questions
6 pages
Sampath Polishetty BigData Consultant
No ratings yet
Sampath Polishetty BigData Consultant
7 pages
Apache Hive
No ratings yet
Apache Hive
77 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Spark Interview 4
No ratings yet
Spark Interview 4
10 pages
Hadoop Interview Guide
100% (1)
Hadoop Interview Guide
34 pages
AaxHadoop Interview Questions and Answers
No ratings yet
AaxHadoop Interview Questions and Answers
37 pages
2 - Apache Airflow
No ratings yet
2 - Apache Airflow
5 pages
Apache Airflow
50% (2)
Apache Airflow
8 pages
DataStage Faq S
No ratings yet
DataStage Faq S
57 pages
Parallel Programming With Spark: Matei Zaharia
No ratings yet
Parallel Programming With Spark: Matei Zaharia
40 pages
Some of The Frequently Asked Interview Questions For Hadoop Developers Are
100% (1)
Some of The Frequently Asked Interview Questions For Hadoop Developers Are
72 pages
Business Intelligence DW
No ratings yet
Business Intelligence DW
17 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Hadoop: Fasilkom/Pusilkom UI (Credit: Samuel Louvan)
No ratings yet
Hadoop: Fasilkom/Pusilkom UI (Credit: Samuel Louvan)
44 pages
Big Data Engineer Interview Questions
No ratings yet
Big Data Engineer Interview Questions
1 page
Hadoop Performance Tuning
100% (1)
Hadoop Performance Tuning
13 pages
7 Hive Notes
No ratings yet
7 Hive Notes
36 pages
Company Interview Question Bank
No ratings yet
Company Interview Question Bank
16 pages
Spark ETL and Process
No ratings yet
Spark ETL and Process
15 pages
Bigdata Interview Preparation Guide
No ratings yet
Bigdata Interview Preparation Guide
292 pages
Hbase: Q) What Is Hbase ?
No ratings yet
Hbase: Q) What Is Hbase ?
15 pages
13 SparkBuildingAndDeploying
No ratings yet
13 SparkBuildingAndDeploying
53 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
Midhun BIGDATA Curicullum
No ratings yet
Midhun BIGDATA Curicullum
17 pages
Cloudera Certification Dump - 410-Anil
100% (3)
Cloudera Certification Dump - 410-Anil
49 pages
Hive Tutorial For Beginners: Learn With Examples in 3 Days
No ratings yet
Hive Tutorial For Beginners: Learn With Examples in 3 Days
3 pages
Spark Sample Resume 2
100% (1)
Spark Sample Resume 2
7 pages
10 SparkBasics
No ratings yet
10 SparkBasics
45 pages
Akash Resume
No ratings yet
Akash Resume
7 pages
Nagarjuna Hadoop Resume
No ratings yet
Nagarjuna Hadoop Resume
7 pages
DVS SPARK Course Content PDF
No ratings yet
DVS SPARK Course Content PDF
2 pages
Sqoop Export and Import Commands
No ratings yet
Sqoop Export and Import Commands
5 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Scala PDF
No ratings yet
Scala PDF
29 pages
Oozie Tutorial
No ratings yet
Oozie Tutorial
84 pages
Hadoop Big Data Administration
No ratings yet
Hadoop Big Data Administration
6 pages
Hive Interview Questions Answers
No ratings yet
Hive Interview Questions Answers
6 pages
Final Print Py Spark
No ratings yet
Final Print Py Spark
133 pages
Install Sqoop
No ratings yet
Install Sqoop
7 pages
Sqoop Cammand
No ratings yet
Sqoop Cammand
8 pages
HBase Administration Cookbook
From Everand
HBase Administration Cookbook
Yifeng Jiang
No ratings yet
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Mysql
100% (1)
Mysql
11 pages
Hive Configuration: Shashwat Shriparv
No ratings yet
Hive Configuration: Shashwat Shriparv
5 pages
Probability Terminology and Concepts
No ratings yet
Probability Terminology and Concepts
13 pages
Apache Tomcat
No ratings yet
Apache Tomcat
18 pages
Hadoop Fully Distributed Cluster
No ratings yet
Hadoop Fully Distributed Cluster
5 pages
Configure HBase Hadoop and Hbase Client
No ratings yet
Configure HBase Hadoop and Hbase Client
16 pages
Next Generation Technology
No ratings yet
Next Generation Technology
4 pages
C# Interview Quesions
No ratings yet
C# Interview Quesions
10 pages
Search Engine
No ratings yet
Search Engine
42 pages
Secondary Storage Devices
No ratings yet
Secondary Storage Devices
36 pages
Project Oxygen : Shashwat Shriparv Infinitysoft
No ratings yet
Project Oxygen : Shashwat Shriparv Infinitysoft
25 pages
Shashwat Shriparv Infinitysoft: Access To Non Local Names
No ratings yet
Shashwat Shriparv Infinitysoft: Access To Non Local Names
12 pages
Object AND Classes: in Java
No ratings yet
Object AND Classes: in Java
9 pages
NewPaper Problem
No ratings yet
NewPaper Problem
12 pages
Network Structures: Shashwat Shriparv Infinitysoft
No ratings yet
Network Structures: Shashwat Shriparv Infinitysoft
12 pages
Microsoft Surface Introduction
No ratings yet
Microsoft Surface Introduction
25 pages
Microsoft Surface Introduction
No ratings yet
Microsoft Surface Introduction
25 pages
Shashwat Shriparv Infinitysoft
No ratings yet
Shashwat Shriparv Infinitysoft
38 pages
Jini Network Technology
No ratings yet
Jini Network Technology
45 pages
System Programming: Shashwat Shriparv Infinitysoft
No ratings yet
System Programming: Shashwat Shriparv Infinitysoft
40 pages
Java Ring: Shashwat Shriparv Infinitysoft
No ratings yet
Java Ring: Shashwat Shriparv Infinitysoft
33 pages
Issues Regarding Mis Structure: Shashwat Shriparv Infinitysoft
No ratings yet
Issues Regarding Mis Structure: Shashwat Shriparv Infinitysoft
15 pages
Learn How Backup Your Exchange Infrastructure
No ratings yet
Learn How Backup Your Exchange Infrastructure
21 pages
Topics High-Density Memory Architecture: Rom Sram Dram
No ratings yet
Topics High-Density Memory Architecture: Rom Sram Dram
5 pages
Python For Programmers - A Project-Based Tutorial
No ratings yet
Python For Programmers - A Project-Based Tutorial
131 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
Serial Eeprom: K24C02 / K24C04 / K24C08 / K24C16
No ratings yet
Serial Eeprom: K24C02 / K24C04 / K24C08 / K24C16
14 pages
Ilovepdf Merged PDF
No ratings yet
Ilovepdf Merged PDF
560 pages
Chapter _Transport layer
No ratings yet
Chapter _Transport layer
24 pages
Displaying Data From Multiple Tables
No ratings yet
Displaying Data From Multiple Tables
36 pages
Unit 4
No ratings yet
Unit 4
67 pages
Case Study 2
No ratings yet
Case Study 2
15 pages
Endesin Com Support Nxatoz Nxnetworkconfigurationpart2 Aspx
No ratings yet
Endesin Com Support Nxatoz Nxnetworkconfigurationpart2 Aspx
8 pages
Hello World
No ratings yet
Hello World
2 pages
Chapter 22 System Registers
No ratings yet
Chapter 22 System Registers
86 pages
TippingPoint X505 Training - IPS - General Concepts and Configuration
0% (1)
TippingPoint X505 Training - IPS - General Concepts and Configuration
35 pages
INS (4360704) Practical Assignment
No ratings yet
INS (4360704) Practical Assignment
5 pages
Infoscale Support Vmware Vmotion RDMP
No ratings yet
Infoscale Support Vmware Vmotion RDMP
5 pages
19 3 RTU560 Training 3
No ratings yet
19 3 RTU560 Training 3
8 pages
Mod 5-Chap 13 & 14 Notes
No ratings yet
Mod 5-Chap 13 & 14 Notes
21 pages
Cambium Networks PMP 450 Access Point Specification
No ratings yet
Cambium Networks PMP 450 Access Point Specification
2 pages
General Questions: 1. What Is Java?
100% (1)
General Questions: 1. What Is Java?
125 pages
10 Frequently Asked SQL Query Interview Questions - Java67
100% (1)
10 Frequently Asked SQL Query Interview Questions - Java67
24 pages
Comp 4 Documentation
No ratings yet
Comp 4 Documentation
213 pages
Microprocessors and Interfacing
No ratings yet
Microprocessors and Interfacing
125 pages
What'S New For Ultraqueue, Nesterserver and Nesterpack For 6.1.1 (On Accumark 8.1.1) January 2005
No ratings yet
What'S New For Ultraqueue, Nesterserver and Nesterpack For 6.1.1 (On Accumark 8.1.1) January 2005
19 pages
Performance Counter Priorities: Metric Browser Location Metric Name Importance
No ratings yet
Performance Counter Priorities: Metric Browser Location Metric Name Importance
2 pages
Query Performance Tuning
No ratings yet
Query Performance Tuning
35 pages
Database Lab
No ratings yet
Database Lab
57 pages
Topic 2: Data Modelling
No ratings yet
Topic 2: Data Modelling
54 pages