The document describes linear regression as a statistical technique for forecasting the performance of Oracle-based systems. It explains that linear regression finds the linear relationship between two variables, such as the number of user calls and CPU utilization. It discusses exploring the relationship through scatter plots and correlation coefficients. A strong positive correlation is found between the number of order lines and CPU utilization using sample data. The regression coefficients of the linear model are then estimated using the least squares method in Excel, resulting in the best-fit line and equation to forecast CPU utilization based on order lines.
The document describes linear regression as a statistical technique for forecasting the performance of Oracle-based systems. It explains that linear regression finds the linear relationship between two variables, such as the number of user calls and CPU utilization. It discusses exploring the relationship through scatter plots and correlation coefficients. A strong positive correlation is found between the number of order lines and CPU utilization using sample data. The regression coefficients of the linear model are then estimated using the least squares method in Excel, resulting in the best-fit line and equation to forecast CPU utilization based on order lines.
We may encountered several situations in our daily life when we are asked to forecast the performance of Oracle-based systems. Many questions strike our mind like: How much load of some particular business activity, our existing system can support, before running out of gas? What is the optimal volume of a business activity that system can support without any performance problem? If the workload grows by x percentage every quarter, when we will need to add more capacity to the system?
The questions you are asked to answer, may vary according to your business requirements; but the bottom line remains the same. As a Capacity Planner, you are asked to do the capacity planning and intimate the management regarding the additional capacity requirements in proactive fashion. This paper is all about forecasting of Oracle-based systems performance. Throughout the paper, I will explain a statistical method called Linear Regression; that is industry-proven, easy and timesaving technique to answer all such complex questions. I hope that after reading the paper, you will feel more confident while forecasting Oracle performance. Here, the approach followed for forecasting is based on a statistical technique, so if you are not from statistics background, at times it may feel a bit harder for you to digest the text and terminology. Dont Worry!! I have tried to explain the statistics terms in detail, and at times doing that, also have diverted from main theme.
What is Linear Regression?
In very simple words, regression analysis is a method for investigating relationships among variables. In context of Oracle examples of such relations are: Number of sessions vs memory utilization, physical I/O vs. disk subsystem utilization etc. Regression relations can be classified as linear and nonlinear, simple and multiple. For the sake of applicability, here we are only concerned with simple linear Regression (or simply, Linear Regression).
Linear Regression tries to find a linear relationship between two variables. The general form of such a relation is y= mx + c, which is also an equation of a straight line (dont get confused with mathematical terminology). Here c represents the y-intercept of the line and m represents the slope. Seems bit complex, Yes its. Let me explain it in simple words. As a performance analyst, we need to find a relationship between two technical metrics such as user calls and CPU utilization. Such a relationship can be expressed in terms of an equation. Here the variable user calls is used to predict the value of CPU utilization and is known as explanatory variable or simply predictor. On the other hand, the variable whose value is to be predicted, known as response variable or dependent variable. We generally denote response variable by Y and predictor variable as X. If we apply these notations to above example, we can write the relation as:
Sample# 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Order Lines/day 14271 13728 12938 1158 0 11450 5311 17073 11336 7340 11330 0 10679 12803 12827 Avg CPU Utilization 29.86 28.34 34.82 3.22 1.43 34.22 23.58 33.6 23.36 26.76 4.31 2.62 33.44 29.19 28.11 CPU utilization = user calls * m + c
Here m and c are known as regression coefficients or parameters, whose values we need to determine. Once we get these values, we can predict the systems CPU utilization based on the values of user calls.
Exploring relationships with Scatters and Correlations
Before we dig into linear regression to predict a value of response variable, there are some conditions that should be met. There exists a linear relationship between response and prediction variable. In other words, if we plot a scatter diagram between X and Y, it should be a straight line. There should be a strong correlation strength between response and prediction variable. These are the two conditions, which show a strong linear relation between response and prediction variables. Lets discuss these one by one using an example.
Throughout this paper, we will discuss a data warehouse support system. The major workload is the warehouse orders. Thus the key business metric is identified as number of order lines. We wish to find a relation, where we can predict CPU utilization based on the number of order lines entered into the system. Here response variable (Y) is CPU utilization and prediction variable (X) is number of order lines. Here are 31 samples of CPU utilization and number of order line entries, collected throughout the month.
3 Exploring relationships with Scatter plot
A scatter plot is a 2-dimensional graph that displays pairs of data, one pair per sample in the (x, y) format. We can use Excels Chart Wizard to plot a scatter diagram for the above example.
Figure 1: Scatter plot of Order lines vs. CPU Utilization
Here you can see that relationship appears to be a straight line, except some points, where CPU utilization is not following the trend. These points are known as Outliers and we will discuss them during later course of the paper. Now since its confirmed that the condition of linear regression meets, we need to check how strong the relation is.
Exploring relationships with Correlation Coefficient
Correlation Coefficient measures the strength and direction of the linear relationship between response and prediction variable. Its a number between 1 and +1. Higher the absolute value of r, stronger is the linear relationship between the variables. Negative value of correlation coefficient represents negative relation between response and prediction variable, means that if X increases then Y decreases and vice versa. We can calculate correlation coefficient of a set of x and y values, using following formula:
Corr coeff r = (y i y)( x i x) / SQRT ( (x i y) 2 (x i x) 2 )
Again there is no need to solve this complex mathematical equation; Excels predefined function CORREL () is available for us (Good news!).
The square of correlation coefficient (i.e., r 2 ) denotes how much response variable can be explained from the prediction variable. In our previous example, correlation coefficient is calculated as r= 0.818 and r 2 = 0.669, so we can say that 66.9% of CPU utilization output can be explained by order lines/day data. The remaining 33.1% of the CPU utilization output cannot be explained from the prediction variable under observation, but by some other variables (other technical metric). In order to do the forecasting precisely, we need to find a variable which is most likely explain the response variable, or in order words, whose correlation strength is maximum.
Where 1 value of correlation coefficient indicates perfect correlation between response and prediction variables (you will rarely find it in real environments), Zero (0) exhibits absence Order Lines vs CPU Utilization 0 10 20 30 40 50 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Number of Order Lines (X) C P U
U t i l i z a t i o n
( Y ) 4 of correlation. Any other value of correlation coefficient between 1 and +1 indicates limited degree of correlation. Following table * gives practical meaning of various correlation coefficients:
Correlation Coefficient (r) Practical Meaning
0.0 to 0.2 Very week 0.2 to 0.4 Week 0.4 to 0.7 Moderate 0.7 to 0.9 Strong 0.9 to 1.0 Very strong
* Reference 1
Regression Coefficients Estimation
Now after we have established that there is a strong linear relationship between response variable and predictor, we are ready to estimate the regression coefficients. This is equivalent to finding the equation of a straight line that best fits the points on the scatter diagram of response variable versus the predictor variable. There are various methods available, one of which is known as least squares method. It gives the equation of a straight line, which minimizes the sum of squares of the vertical distances from each point to the line.
Using least squares method, the values of slope m and y-intercept c can be given as:
Slope m = (y i -y)(x i -x) / (x i -x) 2 And y-intercept c = y m * x Where x = mean value of x-values, y = mean of y-values
If we dont want to solve these mathematical formulas, Microsoft Excel is there to help you. It has predefined statistical functions Slope() and Intercept(), which takes X and Y values, and return us the slope and y-intercept of the regression line respectively. Excel can help us to plot best-fitted line along with its equation, all we need to do is to select graphs points, right click on it and select add Trend line. Select linear trend line and in options tab, check the display equation on chart box. 5 Here is the scatter plot for our previous example along with best-fit line and its equation:
Figure 2: Best-fitted line for Order lines vs. CPU Utilization plot
In this case slope m = 0.0019 and y-intercept c = 4.6317
Forecasting through Linear Regression
Now we have all the data that we need for forecasting. We have equation of a straight line that best fits the response and prediction variables. Suppose we want to predict the CPU utilization when there are 13141 order lines/day. We can calculate this as follows:
Y-estimated = 0.0019 * 13141 + 4.6317 = 30.18%
From sample data, we know that at 13141 order lines/day, CPU utilization is observed as 28.08 %, we can see that there is 2.0 difference in actual and estimated CPU utilization. This is known as residual. In a formal language, residual is the difference between observed value of y and estimated value of y. In general, a positive residual means you have underestimated y at that point and negative value of y implies that you have overestimated y. Residuals are expected in data set, there may be some data points, which are away from main data trend, such data points are known as Outliers. In real production environment Outliers may be caused due to various reasons. Some of them are, backups, one-time report generation process or problem in data collection. Any process, which is not a part of routine workload will account for Outliers. While Outliers distort the regressions coefficients, they sometimes alert us regarding the problem in data collection process. The bottom line is, Outlier removal is most important part in forecasting through regression analysis. When we detect Outliers, we need to think of whether these are part of normal workload and if its, then we should include the same in our analysis, otherwise remove. But important thing here to note down is that we should have proper documentation and justification for all the data points that have been removed.
Scatter Plot (CPU utilization vs Order Lines/day) y = 0.001944x + 4.631658 R 2 = 0.668916 0 10 20 30 40 50 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Order Lines per day (X) C P U
U t i l i z a t i o n
( Y ) 6 Outlier Removal Process
Outlier removal process may be very frustrating and time-consuming sometimes. Further, we should know when to stop. Discussion with business/system specialist may help us to identify outliers, points that are not part of a normal workload. Statistically Outliers can be detected using following method. First standardize residuals, means subtract the mean (average) of the residuals and divide by the standard deviation from each calculated residual. Residuals have a property that they have zero as their means. In order words, error above the linear regression line (positive residuals) and below the regression line (negative residuals) is equal. Following is the table containing standardized residual for the case study we are discussing:
Sample Order Lines (X) CPU Utilization (Y) Estimated Y Residual Residual Square Stnd Residual 27 11330 4.31 26.66 -22.35 499.5241 -2.87 8 12259 46.38 29.42 16.95 287.3514 2.53 16 10901 40.02 25.83 14.19 201.4710 1.84 15 5971 29.41 16.24 13.17 173.4282 1.70 1 16483 27.01 36.68 -9.67 93.4851 -1.25 23 5311 23.58 14.96 8.62 74.3461 1.12 29 10679 33.44 25.39 8.05 64.7328 1.04 26 7340 26.76 18.90 7.86 61.7408 1.02 4 11986 20.56 27.94 -7.38 54.3975 -0.95 22 11450 34.22 26.89 7.33 53.6799 0.95 3 12015 21.74 27.99 -6.25 39.0856 -0.81 19 12938 34.82 29.79 5.03 25.3372 0.65 9 6531 21.95 17.33 4.62 21.3484 0.60 24 17073 33.60 37.83 -4.23 17.8580 -0.55 5 1119 2.85 6.81 -3.96 15.6600 -0.51 20 1158 3.22 6.88 -3.66 13.4183 -0.47 25 11336 23.36 26.67 -3.31 10.9674 -0.43 6 0 1.41 4.63 -3.22 10.3791 -0.42 21 0 1.43 4.63 -3.20 10.2506 -0.41 7 0 1.45 4.63 -3.18 10.1229 -0.41 14 1 1.62 4.63 -3.01 9.0818 -0.39 18 13728 28.34 31.32 -2.98 8.8944 -0.39 17 14271 29.86 32.38 -2.52 6.3407 -0.33 10 14086 29.55 32.02 -2.47 6.0930 -0.32 13 454 3.26 5.51 -2.25 5.0821 -0.29 2 13142 32.43 30.18 2.25 5.0489 0.29 12 13141 28.08 30.18 -2.10 4.4145 -0.27 28 0 2.62 4.63 -2.01 4.0468 -0.26 31 12827 28.11 29.57 -1.46 2.1333 -0.19 11 12797 30.04 29.51 0.53 0.2785 0.07 30 12803 29.19 29.52 -0.33 0.1115 -0.04 Sum 0.0 0.0 0.0 7 After calculating standardized residuals, plot them against estimated values of response variables. This is known as standardized residual plot. The motive behind plotting standardized residuals is to check the normality * of the residuals. While most of the standardize residuals should be close to zero, as we move away from zero, the frequency of the residuals should decrease. Further, a standard residual at or beyond 3 and +3 is not acceptable (due to Empirical rule of Standard Scores). These points should be considered as Outliers and we should investigate them further. Secondly, residuals should occur in random, they shouldnt follow a pattern. A pattern in residual plots implies that regression line may not be fitting right.
Note: For normality and empericial rule, I have another document titled Statistics basics for forecasting
Alternate way for detecting the Outliers is to square the residuals and sort the data set in descending order. By doing so all the data points with higher residuals (positive or negative) will come on the top. Further investigation of these data points will lead us to know whether we should exclude these points from our analysis or not.
(a) (b)
Figure 3: Standardize residual plot, (a) before Outliers removal and (b) After Outliers removal
From the above table and figure 3(a), we can see that Sample number 27 and 8 can be treated as Outliers as their standardize residuals are greater than 2.5 (in any direction) and the residual squares are very high as compared to other data points. Now its the time to investigate the reason for these observations.
In this particular case, I personally found that it was Sunday when sample number 8 was collected. So some statistics gathering processes were running which resulted in high CPU utilization. Thats why we have positive residual (we underestimated CPU utilization). Since its not a part of standard business workload, we can safely remove this data point from our data set after proper documentation. Now digging further into the details for sample number 27, I found that there was no reasonable CPU utilization during the complete day. Overall workload was not much, so even after there were 11000+ order lines served; resulted average CPU utilization was only 4%. We need to discuss the reason of low workload with system specialist. This may be due to skewness effect in CPU utilization data (CPU utilization is not evenly distributed throughout the day). We can avoid the skewness effect by increasing the data collection frequency (in this particular case per hour, for example). It should be noted that, while increasing data collection frequency will eliminate skewness problem, it may Standardize Residual Plot (After Outliers removal) -3 -2 -1 0 1 2 3 0 5 10 15 20 25 30 35 40 Estimated Y values (CPU Utilization) S t a n d a r d i z e
R e s i d u a l Standardize Residual Plot (Before Outliers removal) -3 -2 -1 0 1 2 3 0 5 10 15 20 25 30 35 40 Estimated Y values (CPU Utilization) S t a n d a r d i z e
R e s i d u a l 8 result in high resource consumption on production servers. However optimized data collection process can save precious resources (like memory and CPU cycles). Proper Outlier detection and removal is very important step in regression analysis since they can affect regression formula and make the overall forecasting process less precise. In our example, after removing sample numbers 27 and 8 from our analysis, correlation coefficient (r) increases to 0.884 and r 2 = 0.782. It means that now 78.2% of CPU utilization can be explained using order lines/day. Compare there values with the previous ones where r= 0.818 and r 2 = 0.669.
Figure:4 shows the scatter plot of the data set without the Outliers. Clearly regression line is better without the Outliers, where more data points are close to the line.
Figure 4: Scatter plot of CPU Utilization vs. Order Lines/day after Outliers removal
Avoiding Extrapolation
In Linear regression, we end-up with an equation of a straight line, which best fits our data. We substitute a value for x and get a predicted value for y. Plugging x values into the equation that fall outside a reasonable bound is known as extrapolation. We should avoid extrapolation in case of linear regression. In fact if we do, well have wrong results. This can be considered as main drawback of linear regression; it can forecast only during the linearity of the relation. This is due to the fact that systems performance is not linear. For example, in our case, when number of order lines increases, it will increase host CPU utilization. Around 70-75% of CPU utilization queueing behavior comes into the picture, and CPU utilization will never remain linear after that. Thus we can forecast through Linear Regression only during the linear relationships of x and y values. In our case, minimum and maximum observed value of Order lines per day are 0 and 17073 respectively. Choosing order lines/day between 0 and 17073 to forecast is reasonable, but beyond these limits (value greater than 17073) isnt a good idea. We cant make sure that same linear relationship between order lines/day and CPU utilization will be there after 17073 order lines/day. So the bottom line is; never forecast in nonlinear areas. For CPU subsystem, this limit is around 75% and for I/O subsystem this limit is around 65%.
Forecasting in the absence of single non-correlated variable
In our example, we have found that order lines/day is correlated with CPU Utilization and after Outliers removal, able to predict CPU Utilization 78.2% of the times (Correlation coefficient of 0.884 is strong). Scatter Plot (CPU utilization vs Order Lines/day) y = 0.001940x + 4.825653 R 2 = 0.781843 0 10 20 30 40 50 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Order Lines/day (X) C P U
U t i l i z a t i o n
( Y ) 9 We need to collect various Business Workload metrics and check, which is maximum correlating with the Response Variable. Identification process of such metrics is sometimes frustrating and may result in failure. A good approach here is to discuss the systems behavior with all the concerned persons (business/system specialist, Application Engineer etc.), as they better know the business and its impact on the system workload. Its always better to end up the discussion with a good set of identified metrics. No one wants to re- arrange the meeting if the initial metrics are not found be correlated with response variable. There may be a case when no single metric is identified, that is correlated with the response variable. Then we can select a set of variables, which are moderately correlated with response variable and none of them is highly correlated. A Statistical regression approach where we predict response variable on the basis of multiple prediction variables, is know as Multiple Regression Analysis. Discussion of that approach is beyond the scope of this paper. I have separate document, which explain the concept behind multiple regression.
Linear Regression Functions in Oracle
Oracle has inbuilt linear regression functions that support least squares method for regression coefficient estimation.
The functions are as follows:
REGR_COUNT Function: It returns the number of non-null number pairs used to fit the regression line.
REGR_AVGY and REGR_AVGX Functions: REGR_AVGY and REGR_AVGX compute the averages of the dependent variable and the independent variable of the regression line, respectively. REGR_AVGY computes the average of its first argument (dependent variable Y) after eliminating number pairs where either of the variables is null. Similarly, REGR_AVGX computes the average of its second argument (independent variable X) after null elimination.
REGR_SLOPE and REGR_INTERCEPT Functions: The REGR_SLOPE function computes the slope of the regression line fitted to non-null number pairs. The REGR_INTERCEPT function computes the y-intercept of the regression line.
REGR_R2 Function: The REGR_R2 function computes the coefficient of determination (usually called "R-squared" or "goodness of fit") for the regression line.
Sample code- Automate Linear Regression through Oracle functions
Following is sample code, which can be used for linear regression analysis for Oracle performance forecasting. Here I tried to give an example of Oracle linear regression functions. This sample code chooses database workload metric logical reads per sec as a predictor variable. You can customize the code and include other metrics as well like user calls per sec, executions per sec etc. 10 Description:
Table regression_data: This table will be used for storing database technical metrics and regression data. Table outlier_data: This table will store outlier data. It should be noted that outliers would be deleted from regression_data and store in this table, for further investigation and documentation purpose. View dba_hist_sysmetric_summary: This database view is used to fetch historical AWR data.
insert into regression_data (snap_id ,begin_time, end_time,logical_reads) select snap_id,BEGIN_TIME,END_TIME,AVERAGE from DBA_HIST_SYSMETRIC_SUMMARY v, gv$instance i where metric_name='Logical Reads Per Sec' and begin_time >= to_date('1-oct-08 00:00:00','dd-mon-yy hh24:mi:ss') and end_time <= to_date('31-oct-08 00:00:00','dd-mon-yy hh24:mi:ss') and v.instance_number=i.instance_number;
declare cursor c2 is select snap_id,metric_name,BEGIN_TIME,END_TIME,MAXVAL,AVERAGE from DBA_HIST_SYSMETRIC_SUMMARY where metric_name ='Host CPU Utilization (%)'; begin FOR table_scan in c2 loop update regression_data set host_cpu = table_scan.average where snap_id = table_scan.snap_id ; 11 END LOOP; end; /
begin update regression_data set stnd_residual= 4;
select count(*) into outlier_count from regression_data where abs(stnd_residual) > 3;
while outlier_count >0 loop select round(REGR_INTERCEPT (host_cpu, logical_reads),8) into intercept from regression_data ; select round(REGR_SLOPE (host_cpu, logical_reads),8) into slope from regression_data ; FOR table_scan in c1 loop update regression_data set proj_cpu= slope*table_scan.logical_reads + intercept where snap_id=table_scan.snap_id; update regression_data set residual= proj_cpu- host_cpu where snap_id=table_scan.snap_id; update regression_data set residual_sqr = residual*residual where snap_id=table_scan.snap_id; END LOOP; select round(stddev(residual),8) into stnd_dev from regression_data; select round(avg(residual),8) into avg_res from regression_data; FOR table_scan2 in c1 loop update regression_data set stnd_residual= (residual-avg_res)/stnd_dev where snap_id=table_scan2.snap_id; END Loop; select count(*) into outlier_count from regression_data where abs(stnd_residual) > 3; If outlier_count >0 then FOR table_scan3 in c1 loop 12 If abs(table_scan3.stnd_residual) > 3 then insert into outlier_data(snap_id, begin_time,end_time,logical_reads, host_cpu) values (table_scan3.snap_id, table_scan3.begin_time, table_scan3.end_time, table_scan3.logical_reads, table_scan3.host_cpu); delete from regression_data where snap_id=table_scan3.snap_id; end if; END LOOP; end if; END LOOP; commit; end; /
References
1. Statistics Without Tears: A Primer for Non Mathematicians by Derek Rowntree (MacMillan Publishing Company; 1981)
2. Forecasting Oracle Performance by Craig Shallahamer (Apress, 2007)
3. Intermediate Statistics for dummies by Deborah Rumsey (Wiley Publishing, 2007)
4. Regression Analysis by example, fourth edition by Samprit Chatterjee and Ali S.Hadi (John Wiley & Sons, 2006)
5. Oracle9i Data Warehousing Guide Release 2 (9.2) Part Number A96520-01
About the Author
Neeraj Bhatia is a Senior Technical Analyst in Oracle India, based in Noida, India. He has been working with Oracle Databases for five years. Currently he is responsible for Capacity Planning for Oracle-based systems that include Oracle Database, Oracle Application Server, PeopleSoft and Oracle Applications. Prior to this, he worked as a Performance Analyst for VLDBs. When not working with Oracle, he likes to listen music, watch movies and spend a good time with family. He can be reached at: neeraj.dba@gmail.com