DB2 SQL Tuning Best Practices
DB2 SQL Tuning Best Practices
FEBRUARY 2010
TABLE OF CONTENTS
1.0 2.0 3.0 4.0 5.0 Overview Introduction UDB DB2 Database Manager Background Assumptions Best Practices 5.1 Best Practices for Database Configuration 5.1.1 5.1.2 5.1.3 5.1.4 5.2 5.3 5.4 5.5 5.2.1 5.3.1 5.4.1 5.5.1 5.5.2 5.5.3 5.5.4 5.5.5 5.5.6 5.5.7 Database Optimization Class Registry Setting Database Manager Instance Configuration File Parameters Database Configuration File Parameters Database Bufferpool and Tablespace Configuration Database Table and Index Design RUNSTATS Command REORGANIZE and REORGCHK Commands Prioritize then Divide and Conquer Get Baseline Run Times and EXPLAIN Plans Best Practice Coding Techniques Review Joins and Indexes Review All Selected Columns and Table Indexes Retest the Entire Work Load After SQL Performance Tuning DB2 Index Advisor 4 4 5 7 7 7 7 8 9 10 11 11 12 13 14 14 15 15 15 15 17 17 17 18 18 19 19 19 DB2expln Facility 20
Database Table and Index Best Practices UDB DB2 Database RUNSTATS UDB DB2 Database Table Reorganization SQL Workload Tuning Best Practices
db2advis - DB2 design advisor command 5.6 Explain Tools 5.6.1 Visual Explain 5.6.2 Visual Explain Tool
20 21
1.0 Overview
The intent of this document is to describe the best practices for SQL Tuning for DB2 Databases in the LUW environments. The document covers: Database Maintenance for Best Practices Database Configuration for Best Performance Database Design Issues for Best Performance SQL Coding for Best Practices SQL Explain tools for Tuning for Performance
Revision Date 02/02/2010 Revised By Bruce Woodcraft Revision Summary Initial draft
Version 1
2.0 Introduction
This document describes best practices for writing Structured Query Language (SQL) scripts which retrieve data from an IBM DB2 database running on a Linux, UNIX, or Windows (LUW) server. It covers the best practices for writing SQL, reviewing database maintenance that affects data retrieval, database configuration parameters that impact performance, database object design issues for tables and indexes, and using the explain tools to assist in performance tuning activities. SQL Query Tuning Factors can be broken down into several categories: Database Configuration Database Object Maintenance Database Object Design (Tables and Indexes) SQL Coding Techniques DB2 Explain Plan Tools
There are many factors that determine the performance of a given SQL query, and many of which are beyond the control of the SQL query developer. For instance, there are database configuration parameter settings and table maintenance activities that the DBA controls, but; the SQL developer most likely does not have access to change or modify. It has been widely documented in the database tuning annals that the SQL query script is the single largest performance factor in more than three out of four cases. For this reason this document will have the greatest focus on SQL coding techniques for performance. The other contributing factors will be discussed but in far less detail as their remedies are detailed in other documents and are beyond the scope of this document.
The Optimizer
The optimizer is the heart and soul of DB2. It analyzes SQL statements and determines the most efficient access path available for satisfying each statement (see Figure 1). DB2 UDB accomplishes this by parsing the SQL statement to determine which tables and columns must be accessed. The DB2 optimizer then queries system information and statistics stored in the DB2 system catalog to determine the best method of accomplishing the tasks necessary to satisfy the SQL request.
The optimizer is equivalent in function to an expert system. An expert system is a set of standard rules that, when combined with situational data, returns an "expert" opinion. For example, a medical expert system takes the set of rules determining which medication is useful for which illness, combines it with data describing the symptoms of ailments, and applies that knowledge base to a list of input symptoms. The DB2 optimizer renders expert opinions on data retrieval methods based on the situational data housed in DB2's system catalog and a query input in SQL format. The notion of optimizing data access in the DBMS is one of the most powerful capabilities of DB2. Remember, you access DB2 data by telling DB2 what to retrieve, not how to retrieve it. Regardless of how the data is physically stored and manipulated, DB2 and SQL can still access that data. This separation of access criteria from physical storage characteristics is called physical data independence. DB2's optimizer is the component that accomplishes this physical data independence. If you remove the indexes, DB2 can still access the data (although less efficiently). If you add a column to the table being accessed, DB2 can still manipulate the data without changing the program code. This is all possible because the physical access paths to DB2 data are not coded by programmers in application programs, but are generated by DB2. Compare this with non-DBMS systems in which the programmer must know the physical structure of the data. If there is an index, the programmer must write appropriate code to use the index. If someone removes the index, the program will not work unless the programmer makes changes. Not so with DB2 and SQL. All this flexibility is attributable to DB2's capability to optimize data manipulation requests automatically. The optimizer performs complex calculations based on a host of information. To visualize how the optimizer works, picture the optimizer as performing a four-step process: 1. 2. 3. 4. Receive and verify the syntax of the SQL statement. Analyze the environment and optimize the method of satisfying the SQL statement. Create machine-readable instructions to execute the optimized SQL. Execute the instructions or store them for future execution.
The second step of this process is the most intriguing. How does the optimizer decide how to execute the vast array of SQL statements that you can send its way? The optimizer has many types of strategies for optimizing SQL. How does it choose which of these strategies to use in the optimized access paths? IBM does not publish the actual, in-depth details of how the optimizer determines the best access path, but the optimizer is a cost-based optimizer. This means the optimizer will always attempt to formulate an access path for each query that reduces overall cost. To accomplish this, the DB2 optimizer applies query cost formulas that evaluate and weigh four factors for each potential access path: the CPU cost, the I/O cost, statistical information in the DB2 system catalog, and the actual SQL statement.
4.0 Assumptions
This document assumes the target audience has some experience and knowledge of SQL query scripting with some relational database and points out specific best practices for using IBMs UDB DB2 Database product for Linux, UNIX, and Windows (LUW) environments. Also, the UDB DB2 instance and database parameter configure is beyond the discussion for this paper; but, are as they the briefly mention below that these settings have an important role in the overall optimization of performance.
Again, CAUTION should be used when changing this setting. More information and a complete discussion of this setting can be found in the IBM UDB Information Center for LUW. http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/index.jsp 5.1.2 DATABASE MANAGER INSTANCE CONFIGURATION FILE PARAMETERS Each UDB DB2 Instance has an Instance Configuration file that contains 68 parameters. There are a few that have a significant impact on performance which are listed below.
Table source: IBM Redbook DB2 UDB Enterprise Edition V8.1: Basic Performance Tuning Guidelines http://www.redbooks.ibm.com/redpapers/pdfs/redp4251.pdf
2008 Computer Sciences Corporation.
These parameters should be tuned by the database support DBA with CAUTION. For further detail on these parameters see the source document. 5.1.3 DATABASE CONFIGURATION FILE PARAMETERS Each UDB DB2 database has its own Database Configuration File which contains 82 different parameters. Below are the parameters that could have the greatest performance impact. Again use caution when changing any UDB DB2 parameter.
Table source: IBM Redbook DB2 UDB Enterprise Edition V8.1: Basic Performance Tuning Guidelines http://www.redbooks.ibm.com/redpapers/pdfs/redp4251.pdf
Like the DB2 instance setting that can be turned, there are many DB2 Database configurations settings that can have a significant effect on performance of the database. Several key settings are: AVG_APPLS which the Optimizer uses to estimate how much buffer pool memory each which will get, CATALOGCACHE_SZ which determines how much memory is used to catalog the system catalog, and SORTHEAP which specifies amount of memory to be available for each sort operation. The details of tuning these parameters are discussed in detail in the IBM Redbook referenced above and under the UDB DB2 Database Tuning Best Practices and IBMs UDB DB2 Administration manual. 5.1.4 DATABASE BUFFERPOOL AND TABLESPACE CONFIGURATION In any database design and configuration, the size and allocation of the databases bufferpools and table spaces have the most impact factor for improving the databases performance. Buffer pools are used to cache data in memory for reading and writing to disk, and they handle the data much faster from memory than from disks. Generally, there just a few of different page sizes to handle the different table space page sizes. Special purpose buffer pools may be created for specific data and processing methods. Likewise there are many sizes of tablespaces and specific purpose tablespaces. For instance, Temporary Tablespaces are created and assigned to specific buffer pools. UDB DB2 has options for partitioning large tables into multiple tablespaces for data separation and faster I/O performance. Specific data that is used frequently can be set up in its own bufferpool and tablespace so it can stay in memory for fast access. In tuning queries you may come across often-used data that may be separated out and tuned in this fashion. Tablespace changes, and even to a lesser extent bufferpools changes, may be needed to optimize a given query workload and would be the responsibility of a DBA and not a developer. Remember, database configuration changes like the one mentioned above need to be made with CAUTION as they could be counterproductive to other queries in the workload, especially if one bufferpool is reduced to create another. Its for this reason workloads need to be tuned as a group and measured as a group after individually looking at the slow performers and the most often run queries. (Do not underestimate the improvement that can be made to the overall runtime of a work load for a small query that is run a million times.)
10
Five to seven indexes per table with five to nine columns at most.. Most if not all tables will have an index of some kind. Generally most have a unique index that servers as the Primary Key and is explicitly states as the Primary Key. (Note in UDB DB2 it can be created as a CONSTRAINT and will have an index created for it.)
11
Rule to Remember:
Use the Primary Key on a table whenever possible, unless another index provides more columns and faster Access Path.
Unique Indexes can be created on tables that are other than the Primary Key ( PK) and are referred to as Alternate Keys. For example, a sequence number (or identity column) may be added to the row to provide a sequential numeric column to use as the PK and a group of other columns may form the natural key and can be a unique combination of columns. Unique Indexes may Include other none indexed columns that provide a direct data source for a few table columns. This becomes an extremely effective tool especially for large rows with lots of columns. Adding a few extra columns to the Unique Index (or AK) permits the I/O to be limited to the index only, saving big row reads. This technique of I/O is known as Index Only Reads and is quite efficient compared to reading both the index and the data rows. In a Snowflake or a Hub and Spoke data model, where there are a few Fact tables that are linked to numerous Attribute tables, the Fact table should have single column attribute key indexes that match the indexes of the Attribute tables. UDB DB2 has a special join operator called the STAR JOIN which handles this type of joins and index processing in a highly efficient way using RID processing and index ANDing. See the IBM UDB Information Center for complete details of the STAR JOIN.
12
5.3.1 RUNSTATS COMMAND . The UDB DB2 Database uses catalog statistics and column distribution counts to assist the optimizer determine the optimal data access path. Because the optimizer uses these counts to estimate the costs of various steps, these statistics become critical to the decision making process. The RUNSTATS command is used to generate fresh row counts and column distributions after a table has been modified in a significant way since the last time the RUNSTATS command was run.
Rule to Remember:
13
Rule to Remember:
Run REORG command after significant deletions and additions to a table or index. The REORGCHK command calculates statistics on the database to determine if tables or indexes, or both, need to be reorganized or cleaned up.
Rule to Remember:
Run REORGCHK command to check to see if a table or index needs to be cleaned up.
14
selectivity estimates, and requires extra processing at query execution time. These expressions also prevent or hamper query rewrite optimization steps as well. Match JOIN Column Types - Avoid mismatched JOIN values as data type mismatches prevent the use of hash joins. Also note that if the JOIN column data type is CHAR, GRAPHIC, DECIMAL or DECFLOAT the lengths must be the same. Avoid Non-Equality JOINS - JOIN predicates that use comparison operators other than equality should be avoided because the join method is limited to nested loop. Also, the optimizer might not be able to compute an accurate selectivity estimate for the JOIN predicate. When a non-equality JOIN cannot be avoided, be sure an appropriate index exists on either table because the join predicates will be applied on the nested loop join inner. Dont Use Distinct Aggregations - the DISTINCT function causes a sort of the final result set, making it one of the more expensive sorts. Note that there are changes as of DB2 V9 where the optimizer will look to take advantage of an index to eliminate a sort for uniqueness as it currently does in optimizing with a GROUP BY statement today. Rewriting the SQL script using a GROUP BY or using a Sub SELECT (or IN predicate) will usually be more efficient. Also, avoid multiple DISTINCT aggregations [eg., SUM(distinct colx), AVG(distinct coly)] in the same SELECT as this becomes very expensive as the optimizer rewrites the original query into separate aggregations and SORTs, for each specifying DISTINCT keyword, and then combines the multiple aggregations using a UNION operation. Avoid Outer Joins Unless Necessary - The left outer join can prevent a number of optimizations, including the use of specialized star-schema join access methods. However, in some cases the left outer join can be automatically rewritten to an inner join by the query optimizer depending on the other predicates in the SQL script. Use of the inner equijoin is often more efficient so use it were possible. Tell Optimizer How Many Rows to Expect When the result set is know or can be closely estimated, use the OPTIMIZE FOR N ROWS clause along with FETCH FIRST N ROWS ONLY clause. OPTIMIZE FOR N ROWS clause indicates to the optimizer that the application intends to only retrieve N rows, but the query will return the complete result set. FETCH FIRST N ROWS ONLY clause indicates that the query should only return N rows. OPTIMIZE FOR N ROWS along with FETCH FIRST N ROWS ONLY, to encourage query access plans that return rows directly from the referenced tables, without first performing a buffering operation such as inserting into a temporary table, sorting or inserting into a hash join hash table. NOTE, that specify OPTIMIZE FOR N ROWS to encourage query access plans that avoid buffering operations, but retrieve all rows of the result set, could experience degraded performance. This is because the query access plan that returns the first N rows fastest might not be the best query access plan if the entire result set is being retrieved. Avoid Redundant Predicates - Eliminate duplicate predicates, especially when they occur across different tables. In some cases, the optimizer cannot detect that the predicates are redundant. This might result in cardinality underestimation and the selection of a suboptimal access plan. Review SQL script for columns with same data but different column
16
names where the same tests are being performed. Again keep the predicates as simple as possible and remove the same test on similar columns wherever possible. Select Only the Columns Needed Avoid using SELECT * as you return all the columns for each row returned. This will cause more I/O processing and slow down SORTS with needless data. Also, dont select columns you know the value for in the SQL script which causes more unneeded data handling. For example, SELECT A, B,C WHERE C=1958 causes column C data to be processed needlessly. Also, dont select columns for sorting or grouping if these columns are not needed in the return data set. Select Only the Rows Needed Reducing the set of rows returned in a result set will make the query handle less data and run faster. Use row filter predicates to limit the rows of data being returned. When writing a SQL script with multiple predicates, determine the predicate that will filter out the most data from the result set and place that predicate at the start of the list. By sequencing your predicates in this manner, the subsequent predicates will have less data to filter and process. Use and INDEX in place of a SORT Creating an index on commonly sorted data columns could save a SORT of the result set.
5.5.4 REVIEW JOINS AND INDEXES Table joins should always use indexed columns whenever possible for best performance. Review the JOINS and columns used. Ideally use the Primary Key for at least one of the tables. Using index columns in the JOINS permits the optimizer to use the column statistics and index to determine the best access path and could reduce the I/O by using the index rather than the data from the table. The use of indexed columns in filtering predicates reduces the processing required and data handling by utilizing the indexes and index processing methods. 5.5.5 REVIEW ALL SELECTED COLUMNS AND TABLE INDEXES Selected columns should be reviewed as well as the JOIN columns. Needed columns to satisfy the query may be available in the index used for a table JOIN or an index used for accessing the table. If all of the selected columns are in an index, then I/O processing can be limited just to the index pages. This is known as Index-Only Read which is much more efficient then reading both the index and the data table. Note, UNIQUE indexes can have data columns INCLUDED in the index pages. This is very useful when the majority of needed columns are all ready in the index and another column or two is needed from the data row. If the row contains many columns, having all of the needed columns in an index becomes significantly more efficient than the alternative. 5.5.6 RETEST THE ENTIRE WORK LOAD AFTER SQL PERFORMANCE TUNING Making index changes while tuning individual SQL statements may have unplanned impact on other parts of a given workload. It is important to retest the entire workload after tuning the SQL statements individually. Use the recorded baselines to compare performance improvements. Compare the ending explain plans and estimated TIMERONS (unit of estimated run resource costs).
2008 Computer Sciences Corporation.
17
5.5.7 DB2 INDEX ADVISOR DB2 has a tool to review and recommend INDEXES for a specified Query Workload. This tool reads a file of SQL Statements and generates a list of used and recommended indexes for that workload (or statement) as well as a list of unused indexes. The output of this tool specifies the percent of estimated performance improvement for each new recommended index and its expected size. Note, this tool may recommend a list of indexes to add for a given work load or statement. Adding indexes involves a tradeoff of storage space and processing time. Be very cautious when adding indexes. See the IBM DB2 Information Center for further details of this tool.
http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/index.jsp?topic=/com.ibm.db2.luw.qb.dbconn.doc/doc/c0004770.html
18
Visual Explain
Visual Explain lets you view the access plan for explained SQL or XQuery statements as a graph. You can use the information available from the graph to tune your queries for better performance. Important: Access to Visual Explain through the Control Center tools has been deprecated in Version 9.7 and might be removed in a future release. For more information, see Control Center tools have been deprecated. Accessing Visual Explain functionality through the Data Studio toolset has not been deprecated. You can use Visual Explain to:
View the statistics that were used at the time of optimization. You can then compare these statistics to the current catalog statistics to help you determine whether rebinding the package might improve performance. Determine whether or not an index was used to access a table. If an index was not used, Visual Explain can help you determine which columns might benefit from being indexed. View the effects of performing various tuning techniques by comparing the before and after versions of the access plan graph for a query. Obtain information about each operation in the access plan, including the total estimated cost and number of rows retrieved (cardinality).
Tables (and their associated columns) and indexes Operators (such as table scans, sorts, and joins) Table spaces and functions.
Note: Note that Visual Explain cannot be invoked from the command line, but only from various database objects in the Control Center. To start Visual Explain:
From the Control Center, right-click a database name and select either Show Explained Statements History or Explain Query. From the Command Editor, execute an explainable statement on the Interactive page or the Script page. From the Query Patroller, click Show Access Plan from either the Managed Queries Properties notebook or from the Historical Queries Properties notebook.
19
5.6.2 DB2EXPLN FACILITY DB2 comes with a operating system level command to generate the Explain Plan for a given SQL statement. See the IBM DB2 Information Center for further details of this tool.
http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/index.jsp?topic=/com.ibm.db2.luw.qb.dbconn.doc/doc/c0004770.html
Description of db2expln output Explain output from the db2expln command includes both package information and section information for each package.
Parent topic: Explain facility Related reference db2expln - SQL and XQuery Explain command
20
6.0 Appendix
21
Worldwide CSC Headquarters The Americas 3170 Fairview Park Drive Falls Church, Virginia 22042 United States +1.703.876.1000 Europe, Middle East, Africa Royal Pavilion Wellesley Road Aldershot, Hampshire GU11 1PZ United Kingdom +44(0)1252.534000 Australia 26 Talavera Road Macquarie Park, NSW 2113 Australia +61(0)29034.3000 Asia 139 Cecil Street #06-00 Cecil House Singapore 069539 Republic of Singapore +65.6221.9095 About CSC The mission of CSC is to be a global leader in providing technology enabled business solutions and services. With the broadest range of capabilities, CSC offers clients the solutions they need to manage complexity, focus on core businesses, collaborate with partners and clients, and improve operations. CSC makes a special point of understanding its clients and provides experts with real-world experience to work with them. CSC is vendor-independent, delivering solutions that best meet each clients unique requirements. For more than 45 years, clients in industries and governments worldwide have trusted CSC with their business process and information systems outsourcing, systems integration and consulting needs. The company trades on the New York Stock Exchange under the symbol CSC.
2008 Computer Sciences Corporation.