Chapter 2 - Query Processing and Optimization
Chapter 2 - Query Processing and Optimization
1
Chapter Content
1. Query Processing
Steps of Processing
2. Methods of Optimization
Heuristic (Logical Transformations)
Transformation Rules
Heuristic Optimization Guidelines
Cost Based (Physical Execution Costs)
Data Storage/Access Refresher
2
What Query?
A query is a request for data or information from a database table or
combination of tables.
What is Query Processing?
Steps required to transform high level SQL query into a correct and efficient
strategy for execution and retrieval.
Query Optimization?
The activity of choosing a single efficient execution strategy (from
hundreds) as determined by database catalog statistics.
Which relational algebra expression, equivalent to the given query,
will lead to the most efficient solution plan?
For each algebraic operator, what algorithm (of several available) do
we use to compute that operator?
How do operations pass data (main memory buffer, disk buffer,…)?
Will this plan minimize resource usage? (CPU/Response Time/Disk)
3
Staff
S_ID FName LName Position Salary branchNo
001 Tola Waqjira Chasher 5000 777
002 Habiba Ahmed Manager 8000 999
003 Abdi Jiregna Casher 6000 555
004 Lelise Boru DBA 7000 999
Branch
branchNo BranchName City
999 Batu Batu Consider
777 Abba Gada Adama these tables
555 Bishoftu Bishoftu for next slides
4
Example: Identify all managers who work in Adama City
• Again requires (1000+50) disk accesses to read from Staff and Branch
• Joins Staff and Branch on branchNo with 1000 tuples (1 employee : 1
branch )
• Requires (1000) disk access to read in joined relation and check predicate
• Total Work = (1000+50) + 2*(1000) = 3050 I/O operations
• 3300% Improvement over Query 1
6
Query 3 (Best)
7
Query Processing Steps
example
(catalog =“BS” catalog= “CS”) since a given book can only be
classified in either of the category at a time
d. Simplification
To detect redundant qualifications, eliminate common sub-
expressions.
Transform the query to a semantically equivalent.
10
2. Query Optimization
Everyone wants the performance of their database to be optimal.
In particular, there is often a requirement for a specific query or
object that is query based, to run faster.
Problem of query optimization is to find the sequence of steps that
produces the answer to user request in the most efficient manner,
given the database structure.
The performance of a query is affected by the tables or queries that
underlies the query and by the complexity of the query.
Given a request for data manipulation or retrieval, an optimizer will
choose an optimal plan for evaluating the request from among the
various alternative strategies. i.e. there are many ways (access
paths) for accessing desired file/record.
hence ,DBMS is responsible to pick the best execution strategy
based on various considerations( Least amount of I/O and CPU resources. )
11
Example: Consider relations r(AB) and s(CD). We require r X s.
Method 1 :
a. Load next record of r in RAM.
b. Load all records of s, one at a time and concatenate with r.
c. All records of r concatenated?
NO: goto a.
YES: exit (the result in RAM or on disk).
Performance: Too many accesses.
Method 2: Improvement
a. Load as many blocks of r as possible leaving room for one block of s.
b. Run through the s file completely one block at a time.
Performance: Reduces the number of times s blocks are loaded by a factor of
equal to the number of r records than can fit in main memory.
Considerations during query Optimization:
Narrow down intermediate result sets quickly. SELECT and PROJECTION
before JOIN
Use access structures (indexes).
12
13
14
15
Approaches to Query Optimization
A. Heuristics Approach
Uses the knowledge of the characteristics of the relational algebra operations .
the relationship between the operators to optimize the query.
Thus the heuristic approach of optimization will make use of:
Properties of individual operators
Association between operators
Query Tree: a graphical representation of the operators, relations, attributes
and processing sequence during query processing. It is composed of three
main parts:
a) The Leafs: the base relations used for processing the query/ extracting the
required information
b) The Root: the final result/relation as an out put based on the operation on the
relations used for query processing
c) Nodes: intermediate results or relations before reaching the final result.
Sequence of execution of operation in a query tree will start from the leaves
and continues to the intermediate nodes and ends at the root.
16
17
Using Heuristics in Query Optimization
18
Query block:
The basic unit that can be translated into the algebraic
operators and optimized.
Contains a single select-from-where expression, as
well as group by and having clause if these are part of
the block.
Nested queries
Within a query are identified as separate query
blocks.
19
Query tree
A tree data structure that corresponds to a relational
algebra expression.
It represents the input relations of the query as leaf nodes
of the tree, and represents the relational algebra operations
as internal nodes.
An execution of the query tree consists of executing an
internal node operation whenever its operands are
available and then replacing that internal node by the
relation that results from executing the operation.
Query graph
A graph data structure that corresponds to a relational calculus
expression.
20
Example:
For every project located in ‘Stafford’, retrieve the project number, the controlling
department number and the department manager’s last name, address and birthdate.
Relation algebra:
PNUMBER, DNUM, LNAME, ADDRESS, BDATE (((PLOCATION=‘STAFFORD’(PROJECT))
DNUM=DNUMBER (DEPARTMENT)) MGRSSN=SSN (EMPLOYEE))
SQL query:
Q2: SELECT P.NUMBER,P.DNUM,E.LNAME,E.ADDRESS, E.BDATE
FROM PROJECT AS P,DEPARTMENT AS D, EMPLOYEE AS E
WHERE P.DNUM=D.DNUMBER AND
D.MGRSSN=E.SSN AND
P.PLOCATION=‘STAFFORD’;
21
22
Summary of Heuristics for Algebraic Optimization:
The same query could correspond to many different relational
algebra expressions and hence many different query trees.
The main heuristic is to apply first the operations that reduce the size of
intermediate results.
Perform select operations as early as possible to reduce the number of
tuples
perform project operations as early as possible to reduce the number of
attributes. (This is done by moving select and project operations as far
down the tree as possible.)
The select and join operations that are most restrictive should be executed
23 before other similar operations.
B. Cost Estimation Approach to Query Optimization
The main idea is to minimize the cost of processing a query.
The cost function is comprised of:
I/O cost + CPU processing cost + communication cost + Storage cost
These components might have different weights in different
processing environments
The DBMs will use information stored in the system catalogue for the
purpose of estimating cost.
The main target of query optimization is to minimize the size of the
intermediate relation. The size will have effect in the cost of:
1) Access Cost of Secondary Storage
2) Storage Cost
3) Computation Cost
4) Communication Cost
5) Memory usage cost
24
1. Access Cost of Secondary Storage
Data is going to be accessed from secondary storage. The disk access cost can
again be analyzed in terms of:
Searching
Reading, and
Writing, data blocks used to store some portion of a relation.
Remark: The disk access cost will vary depending on
The file organization used and the access method implemented for the file
organization.
whether the data is stored contiguously or in scattered manner, will affect the
disk access cost.
2. Storage Cost
• While processing a query, as any query would be composed of many
database operations, there could be one or more intermediate results before
reaching the final output. These intermediate results should be stored in
primary memory for further processing.
• The bigger the intermediate relation, the larger the memory requirement,
which will have impact on the limited available space. This will be
considered as a cost of storage.
25
3. Computation Cost
Query is composed of many operations. The operations could be
database operations like reading and writing to a disk, or
mathematical and other operations like:
Searching
Sorting
Merging
Computation on field values
4. Communication Cost
• In most database systems the database resides in one station and is
accessed by various queries originate from different terminals. This
will have impact on the performance of the system adding cost for
query processing. Thus, the cost of transporting data between the
database site and the terminal from where the query originate should
be analyzed.
5. Memory usage cost
is the cost pertaining to the number of memory buffers needed during
26 query execution.
Large databases
the access cost to secondary storage is the main emphasis.
Smaller databases
the emphasis is on minimizing computation cost.
distributed databases
communication cost must be minimized also.
27
End
28