Hive Query Optimization Infinity
Hive Query Optimization Infinity
com
Well designed tables Partitioning Bucketing and well written queries can improve your query speed and
on the range, single value or a set of values. Imagine log files where each record includes a timestamp. If we partitioned by date, then records for the same date would be stored in the same partition. E.g.: Partition on date. Partition on geography location. Partition on number range.
server on visit of client. These log contains data & time information about browser and location(IP). So we can create table in hive and partition these log data using date & time and we can create sub partition of location. Which looks like :
CREATE TABLE alogs (timstamp BIGINT, detail STRING) PARTITIONED BY (date STRING, loc STRING);
Log Table
Directory Structure
Hive Buckets
Bucketing Hive Tables:
Bucketing hive table result in more efficient queries.
Bucketing imposes extra structure on the table, which Hive can take advantage of when performing certain queries. The two tables are bucketed in the same way, a mapper processing a bucket of the left table knows that the matching rows in the right table are in its corresponding bucket, so it need only retrieve that bucket. Bucket may additionally be sorted by one or more columns. This allows even more efficient map-side joins, since the join of each bucket becomes an efficient merge-sort.
Final Result 4 Main Query 5 Query (1 & 2) & 3 Joined Join Sub query (1 & 2) Joined Join Sub query 1
2 Sub query 2
3 Sub query 3
Misc
So in the above flow, 1,2,4 can run in parallel as sub queries and
then joined finally to 3 and then to 5 and the final query result.
Since map join is faster than the common join, it's better to run the map join whenever possible. Previously, Hive users needed to give a hint in the query to specify the small table. For example, select /*+mapjoin(a)*/ * from src1 x join src2 y on x.key=y.key; Newer hive automatically converts normal join to map join.
Some examples
Or
Select count(*) from (select distinct(column) from table) ??
Answer
M M M M M M
Result
Result
In first case :
Maps send each value to reducer Single reducer counts them all(over head)
In Second Case:
Map splits the values to many reducer
Each reducer generated a list Final job is to count the size of each list
Tips
Hive does not know whether query is bad.
even dont doubt. Explain tells about following Number of jobs Number of map and reduce What job is sorting by What are the directories it will read. So explain will help to see the difference between the two or more queries for the same purpose. Job configuration and history can be studied for the query performance.