Teradata Performance Tuning
Teradata Performance Tuning
1. Introduction
Strong data warehouse performance is critical to keeping users satisfied, attaining service level
agreements (SLAs) and maximizing the return on investment (ROI) in the Teradata system.
Sometimes queries that perform unnecessary full-table scans or other operations that consume too
many system resources are submitted to the data warehouse. Application tuning is a process to
identify and tune target applications for performance improvements and proactively prevent
application performance problems.
Some data warehouses handle millions of queries in a day. This makes it difficult for DBAs to identify
suspect queries. A suspect query is one that either consumes too many system resources or is not
taking advantage of Teradatas parallelism. Identifying and documenting the frequency of problem
queries offers a more comprehensive view of the queries affecting data warehouse performance and
helps prioritize tuning efforts.
DBQL is a rich resource for performance data, as it provides full SQL text, CPU and I/O by query,
number of active AMPs in a query, spool use, number of query steps and full explain text. It also
offers information to calculate suspect query indicators such as large -table scans, skewing (when the
Teradata system is not using all the AMPs in parallel) and large-table-to-large-table product joins.
P age |3
2. Teradata Architecture
P age |4
Evenly distributed tables result in evenly distributed workloads.
The uniformity of distribution of the rows of a table depends on the choice of the
Primary Index.
P age |5
3. Data Distribution
3.1 Primary Index
The value of the Primary Index for a specific row determines the AMP assignment
for that row.
This is done using a hashing algorithm.
Accessing the row by its Primary Index value is: always a one-AMP operation and the
most efficient way to access a row.
Two type UPI (Unique Primary Index) and NUPI (Non Unique Primary Index) .
3.1.1 Accessing Via a Unique Primary Index
A UPI access is a one-AMP operation which may access at most a single row
P age |6
3.1.2 Row Distribution Using a UPI
P age |7
Customer_Number may be the preferred access column for ORDER table, thus a good
index candidate.
Values for Customer_Number are somewhat non-unique.
Choice of Customer_Number is therefore a NUPI.
P age |8
P age |9
Partition by Range
CREATE TABLE Order
(
Ord_number integer Not NULL,
Customer_number integer NOT NULL,
Order_date date,
Order_total integer
)
PRIMARY INDEX(Customer_number)
Partition by range1(
Order_date between date 2013-01-01 AND date2013-12-01
Each interval 1 month
NO Range OR UNKNOWN);
Whitepaper | TERADATA PERFORMANCE TUNING
P a g e | 10
P a g e | 11
Make sure stats are re-collected when at-least 10% of data changes
remove unwanted stats or stat which hardly improves performance of the queries
Collect stats on columns instead of indexes since index dropped will drop stats as well
collect stats on index having multiple columns, this might be helpful when these columns are
used in join conditions
Check if stats are re-created for tables whose structures have some changes
Example1:
Explain before collecting stats
P a g e | 12
Examples:
COLLECT STATISTICS on Emp_Table ;
COLLECT STATISTICS on Emp_Table COLUMN Dept_no ;
COLLECT STATISTICS on Emp_Table COLUMN(Emp_no, Dept_no);
COLLECT STATISTICS on Emp_Table INDEX Emp_no ;
COLLECT STATISTICS on Emp_Table INDEX (First_name, Last_name);
Whitepaper | TERADATA PERFORMANCE TUNING
P a g e | 13
Table-level statistics known as "summary statistics" are collected whenever column or index
statistics are collected.
SHOW SUMMARY STATISTICS VALUES ON Employee_Table;
P a g e | 14
P a g e | 15
Make sure you are joining columns that have same data types to avoid translation.
Tip 3: Do not use functions like SUBSTR, COALESCE, CASE ... on the indices used as part of Join
Avoid using functions such as SUBSTR,COALESCE, CASE on the indices used as join.
Optimizer will not be able to read stats on those columns which have functions associated to
it as it is busy converting functions.
Might result in product join, spool out issues and opti mizer will not be able to take decisions
since no stats are available on the column.
Tip 4: Not Null columns
Make sure to use NOT NULL for columns which are declared as NULLABLE in TABLE definition
reason being Null values might get sorted to one poor AMP resulting in infamous "NO SPOOL
SPACE" error as that AMP cannot accommodate any more Null values
Recommended to use Not Null condition while joining on the nullable columns of a table so
that table skew can be avoided.
P a g e | 16
Tip 5: Usage of Like clause
Example:
LIKE %SUBIN% will be processed differently from SUBIN %
In the former, the optimizer needs to do a full table scan which reduces the performance.
In the latter, the optimizer makes use of the index to perform on query thereby increasing
the performance.
If LIKE is used in a WHERE clause, it is better to try to use one or more leading character in
the clause, if at all possible.
Hence it is suggested to go for '% SUBIN %' only if SUBIN is a part of entire pattern say
'SUBSTRING'.
Tip 6: Distinct Vs Group by
Both return same number of rows but with some execution time difference between them.
GROUP BY sorts the data locally on vprocessor while DISTINCT redistribute data then it sorts
the data.
When data is nearly unique in a table, GROUP BY will spend more time attempting to
eliminate duplicates that do not exist at all.
DISTINCT redistributes the rows immediately, more data may move between the AMPs
whereas GROUP BY that only sends unique values between the AMPs.
Steps used in each case for elimination of Duplicates:
GROUP BY
It reads all the rows part of GROUP BY.
It will remove all duplicates in each AMP for given set of values using
"BUCKETS" concept.
Hashes the unique values on each AMP.
Then it will re-distribute them to particular /appropriate AMP's.
Once redistribution is completed, it
a. Sorts data to group duplicates on each AMP
b. Will remove all the duplicates on each amp and sends the
original/unique value
DISTINCT
P a g e | 17
Hence it is better to go for
GROUP BY :
When Many duplicates
DISTINCT
:
When few or no duplicates
Tip 7: Which is faster? select * from table or select 'all Columns' from table ??
In case of using "select * from table, An extra stage is added where * is replaced by
column names by teradata and then it would fetch the data .
But using "select <all Columns > from table eliminates this extra stage of verifying and
fetching on columns from the table.
Hence it is always recommended to use "select <all Columns > from table"
P a g e | 18
Tip 11: Strategic Semicolon
At the end of every sql statement, there is a semicolon.
In some cases, the strategic placement of this semicolon can improve the sql time of a group
of sql statements.
But this will not improve an individual sql statements time.
Example:
1) The groups sql time could be improved if a group of sql
statements
share the same tables (or spool files)
2) The groups sql time could be improved if several sql statements use the
same unix input file.
Tip 12: Unix split OR Unix concatenation
Split
A large input unix files could be split into several smaller unix files, which could then be
input in series, or in parallel, to create smaller SQL processing steps.
Concatenation
A large query could be broken up into smaller independent queries, whose output is
written to several smaller unix files.
Then these smaller files are unix concatenated together to provide a single unix file.
P a g e | 19
Tip 15: Top Vs SAMPLE
TOP 10 means "first 10 rows in sorted order".
The optimizer is free to select the cheapest plan it can find and stop processing as soon as
it has found enough rows to return.
SAMPLE does extra processing to try to randomize the result.
At a very simple level, for example, it could pick a random point at which to start scanning
the table and a number of rows to skip between rows that are returned.
Top really comes into good use when you are dealing with larger tables and queries
because rather than running the entire query and then returning 'sample' records, as the
query runs, top simply picks the first (or 'top') 10 records which have been returned
from any node, and then stops the query.
P a g e | 20
Tip 19: NO LOG for volatile tables
Create volatile tables with NO LOG option.
Tip 20: UPDATE clause and replacing UPDATE with DELETE & INSERT
Do not write UPDATE clause with just SET condition and no WHERE condition.
Even if the Target/Source has just one row, add WHERE clause for PI column.
Sometimes replacing UPDATE with DELETE & INSERT can save good number of AMPCPU.
Check if this holds good for your query.
P a g e | 21
is much less efficient than:
SELECT customer_number, customer_name
FROM customer
WHERE customer_number BETWEEN 1000 and 1004
Assuming there is a useful index on customer_number, the Query Optimizer can locate a range
of numbers much faster (using BETWEEN) than it can find a series of numbers using the IN clause.
Tip 25: MultiLoad delete or Delete command
MultiLoad delete is faster than normal Delete command, since the deletion happens in data
blocks of 64Kbytes, whereas delete command deletes data row by row.
Transient journal maintains entries only for Delete command since Teradata utilities doesnt
support Transient journal loading.
P a g e | 22
6. Glossary
Acronym
Expansion
SLA
ROI
Return On Investment
DBQL
AMP
PE
Parsing Engine
UPI
NUPI
PPI
FTS
PI
Primary Index
SI
Secondary Index