Correlation and Regression

CORRELATION AND REGRESSION
Discussion
When comparing two different variables, there are two questions to be considered: “Is
there a relationship between two variables?” and “How strong is that relationship?” These
questions can be answered using regression and correlation. Regressionanswers whetherthere is a
relationship and correlation answers how strong the linear relationship is. A scatter plot is a
graphical representation of the relation between two or more variables. Inthe scatter plot of two
variables x and y, each point on the plot is an x-y pair. An inspection of a scatterplot can give an
impression of whether two variables are related and the direction of their relationship.
Correlation is a statistical measure that indicates the linear relationship or association

between variables, but no causal relationship is implied. Descriptive statistics that expresses the
degree of relation between two variables is called correlation coefficient (ρ).The sample
correlation coefficient is denoted by Correlation also tells the strength and the direction of the
linear relationship. A positive correlation indicates the extent to which those variables increase or
decrease in parallel; a negative correlation indicates the extent to which one variable increases as
the other decreases. The range of values of the correlation coefficient is from -1 to +1. If there is a
perfect positive linear correlation between variables, the value is equal to +1. On the other hand,
a perfect negative linear correlation between variables has a value for r that is equal to 1. No
linear correlation between variables has a value of r equals 0.
Since correlation is an effect sized, we can describe the strength of the correlation using
the guide that Evans (1996) suggested for the absolute value of r:
• .00-.19 very weak or negligible correlation
• .20-.39 weak or low correlation
• .40-.59 moderate correlation
• .60-.79 strong or moderately high correlation
• .80-1.0 very strong or high correlation
For example, a correlation value of 0.51 would be a moderate positive correlation.
SCATTER PLOT OF CORRELATION COEFFICIENT
Positive Linear Negative LinearNo RelationshipNone or Weak
Weak LinearStrong LinearNonlinearLinear or

or Curvilinear Exponential?
One way to determine the value of the correlation coefficient is by using Pearson
ProductMoment Coefficient or Pearson r, which the formula is:
n ( Σxy )− ( ∑ x ) ( ∑ y )
r=
√ [ n ( Σ x ) −( ∑ x ) ][ n ( Σ y ) −( ∑ y ) ] ❑
2 2 2 2
Where n = number of data pairs

x = observed data for the independent variable
y = observed data for the dependent variable
Pearson r is used for determining the degree of significance of the correlationcoefficient.

This describes how the two variables are related by chance, and a hypothesis test for r is done to
help decide whether the observed r could have emerged by chance or not. In hypothesis testing,
the sample correlation coefficient is used to estimate the true correlation coefficient r that would
be observed if all population values were obtained. This true correlation coefficient is usually
given the Greek symbol ρ or rho.
The null hypothesisstates that there is no significant correlation between the two
variables X and Y. That is, if ρ is the true correlation coefficient for the two variables X and Y
when all population values are observed, then the null hypothesis is Ho: ρ = 0.
The alternative hypothesis could be any one of three forms, with Ha: ρ ≠0, ρ< 0, or ρ> 0.
This means that there is a significant correlation between the two variables X and Y.
The test statistic for the hypothesis test is the sample or observed correlation coefficient.
The T-test is a test statistic used to assess the relationship on various samples that are drawn
ineach of sample size. The formula of the t-test is:
t=r
√ n−2
1−r 2
Where the degrees of freedom (df) = n – 2.
The null hypothesis is rejected at a specific level if there is a significant

differencebetween the value of r and 0. Otherwise, the null hypothesis is accepted at a specific
level if the value of r has no significant difference from 0.
STEPS IN HYPOTHESIS TESTING
1. State the null hypothesis (Ho) and the alternative hypothesis (Ha).
2. Choose the level of significance α.
3. Select the appropriate test statistic and establish a critical region. Formulate decision rule.
4. Collect the data and compute the value of the test statistic from the sample data.
5. Make a decision. Reject Ho if the value of the test statistic is in the critical region,otherwise,
do not reject Ho.
EXAMPLE 4.21
A coach investigated the efficiency of physical training on the weight of his athletes. Pre-
training and post-training weights are listed below, determine if (a) there is a relationship between
the two variables; (b) physical training is effective.
Pre-training weight (X) 67 84 65 69 72 74 66 69 73 80

Post-training weight (Y) 64 80 65 67 70 73 66 67 70 77
Solution(a)Arrange the data in example 4.21 as shown in the table below.
Trainee X Y XY X2 Y2
A 67 64 4288 4489 4096
B 84 80 6720 7056 6400
C 65 65 4225 4225 4225
D 69 67 4623 4761 4489
E 72 70 5040 5184 4900
F 74 73 5402 5476 5329
G 66 66 4356 4356 4356
H 69 67 4623 4761 4489
I 73 70 5110 5329 4900
J 80 77 6160 6400 5929
Total 719 699 50547 52037 49113
Substitute the values to the formula of r.
n ( Σxy )− ( ∑ x ) ( ∑ y )
r=
√ [ n ( Σ x ) −( ∑ x ) ][ n ( Σ y ) −( ∑ y ) ] ❑
2 2 2 2
10 ( 50,547 ) −( 719 ) ( 699 )

r=
√ [ 10 ( 52,037 ) −( 719 ) ][ 10 ( 49,113) −( 699 ) ] ❑
2 2
r =0.98392
The correlation coefficient 0.98392 tells that there is a very strong positive correlation
between the pre-training and post-training of the athletes.
(b) Test the Significance of the correlation. Use α = 0.05; n =10; and = 0.98392
1. Ho: ρ = 0
Ha: ρ≠ 0
2. Level of significance: α = 0.05
3. Critical value of t-test – 2 tailed

df = 8; ta/2 = 2.306 (refer to the t distribution table)
Decision Rule: Reject Ho if │tc│> ta/2 ; otherwise, fail to reject Ho.
4. Computation:
t C =r
√ n−2
1−r 2
=0.98392
10−2
√
1−(0.98392)2
=15.53332
5. Since tcis greater than ta/2, then reject Ho. 15.53332 >2.306 Ho. There is a significant difference
in the pre-training and post-training weights of the athletes. Moreover, it has a very strong
positive correlation. Therefore, the physical training is effective.
CORRELATION AND CAUSATION
When two variables have a large positive or negative correlation with each other, there is often a
tendency to regard the two variables as causally related. The following relationship occur:
1. There is a direct cause and effect relationship between the variables; that is; x caused, y.
2. There is a reverse cause and effect relationship between the variables; that is,y caused, x.
3. The relationship between the variables may be caused by a third variable.
4. There may be complexity of interrelationship among many variables.
5. The relationship may be coincidental.
In a simple relationship (x, y), it is composed of an independent or predictor variable and

dependent response variable. The regression analysisdetermines where there is an independent
variable that is used to predict the dependent variable. It explores relationships that can be readily
described by straight lines or their generalization to many dimensions. When there is a single
continuous dependent variable and a single independent variable, the analysis is called a simple
linear regression analysis and can be expressed in this form:
Y = a + bx
Where Y = predicted value ofthe dependentvariable
a = the y-intercept ; a =
Σy
n
−b ( )
∑x
n
→ y−b x
n ( Σxy )−( Σx ) ( ∑ y )
b = slope of the line; b=
n ( Σ x2 ) −( Σx )
2
Given data points (xi, yi)a and b shall now be chosen in the way that the corresponding
linear line will have the best fit for the given data. The criteria for best fit used in regression
analysis is the sum of the squared differences between the data points and the line itself, that is
the y deviations; thus, the closer the points to the line, the better the fit and the prediction will be.
The coefficient of determination (r2) is the number that expresses the proportion of the
total variation in the values of the dependent variables. The coefficient of determination can be
obtained by squaring the correlation coefficient r.
Coefficient of Determination (r2) = r2× 100%
EXAMPLE 4.22
From the example 21, Use the equation of regression line to predict the efficiency of the
physical training given that the standard athlete’s weight is 65 and determine how much
variations of athletes’ weight is due to the variations of the physical training.
Trainee X Y XY X2
A 67 64 4288 4489
B 84 80 6720 7056
C 65 65 4225 4225
D 69 67 4623 4761
E 72 70 5040 5184
F 74 73 5402 5476
G 66 66 4356 4356
H 69 67 4623 4761
I 73 70 5110 5329
J 80 77 6160 6400
Total 719 699 50547 52037
Solution:
Slope of the regression line
n ( Σxy )−( Σx )( ∑ y ) 10 ( 50.547 )−( 719 )( 699 )

b= = = 0.8475
n ( Σ x ) −( Σx )
2 2 2
10 ( 52037 )−( 719 )
y-intercept
a=
Σy
n
−b
∑x
n( ) =
699
10
−( 0.8475)
719
10
= 8.965
Regression equation:
Y = a +bx = 8.965 + 0.8475x
Solve for the efficiency of the physical training if the standard athlete’s weight is 65.
Y = 8.965 + 0.8475(65) = 64.0525
Calculating the coefficient of determination.
r2 = (0.98392) 2 × 100% = 96.81%
∴From the physical training given, the athletes can obtain a weight of 64, and 96.81% of
the variation in physical training is accounted for the variations in the weight of the athletes. The
remaining 3.19% is caused by other variables, whi ch is called the coefficient of
alienation.
Probability p
Critical Values of t
table shows the upper tail area

entry shows P(t>specified value)
for example, with 10 degrees of freedom,

P(t>1.81246)=.05
Tail Probability p
df 0.25 0.2 0.15 0.1 0.05 0.025 0.02 0.01 0.005 0.0025 0.001 0.0005
1 1.000 1.376 1.963 3.078 6.314 12.706 15.895 31.821 63.657 127.321 318.309 636.619
2 0.816 1.061 1.386 1.886 2.920 4.303 4.849 6.965 9.925 14.089 22.327 31.599
3 0.765 0.978 1.250 1.638 2.353 3.182 3.482 4.541 5.841 7.453 10.215 12.924
4 0.741 0.941 1.190 1.533 2.132 2.776 2.999 3.747 4.604 5.598 7.173 8.610
5 0.727 0.920 1.156 1.476 2.015 2.571 2.757 3.365 4.032 4.773 5.893 6.869
6 0.718 0.906 1.134 1.440 1.943 2.447 2.612 3.143 3.707 4.317 5.208 5.959
7 0.711 0.896 1.119 1.415 1.895 2.365 2.517 2.998 3.499 4.029 4.785 5.408
8 0.706 0.889 1.108 1.397 1.860 2.306 2.449 2.896 3.355 3.833 4.501 5.041
9 0.703 0.883 1.100 1.383 1.833 2.262 2.398 2.821 3.250 3.690 4.297 4.781
10 0.700 0.879 1.093 1.372 1.812 2.228 2.359 2.764 3.169 3.581 4.144 4.587
11 0.697 0.876 1.088 1.363 1.796 2.201 2.328 2.718 3.106 3.497 4.025 4.437
12 0.695 0.873 1.083 1.356 1.782 2.179 2.303 2.681 3.055 3.428 3.930 4.318
13 0.694 0.870 1.079 1.350 1.771 2.160 2.282 2.650 3.012 3.372 3.852 4.221
14 0.692 0.868 1.076 1.345 1.761 2.145 2.264 2.624 2.977 3.326 3.787 4.140
15 0.691 0.866 1.074 1.341 1.753 2.131 2.249 2.602 2.947 3.286 3.733 4.073
16 0.690 0.865 1.071 1.337 1.746 2.120 2.235 2.583 2.921 3.252 3.686 4.015
17 0.689 0.863 1.069 1.333 1.740 2.110 2.224 2.567 2.898 3.222 3.646 3.965
18 0.688 0.862 1.067 1.330 1.734 2.101 2.214 2.552 2.878 3.197 3.610 3.922
19 0.688 0.861 1.066 1.328 1.729 2.093 2.205 2.539 2.861 3.174 3.579 3.883
20 0.687 0.860 1.064 1.325 1.725 2.086 2.197 2.528 2.845 3.153 3.552 3.850
21 0.686 0.859 1.063 1.323 1.721 2.080 2.189 2.518 2.831 3.135 3.527 3.819
22 0.686 0.858 1.061 1.321 1.717 2.074 2.183 2.508 2.819 3.119 3.505 3.792
23 0.685 0.858 1.060 1.319 1.714 2.069 2.177 2.500 2.807 3.104 3.485 3.768
24 0.685 0.857 1.059 1.318 1.711 2.064 2.172 2.492 2.797 3.091 3.467 3.745
25 0.684 0.856 1.058 1.316 1.708 2.060 2.167 2.485 2.787 3.078 3.450 3.725
26 0.684 0.856 1.058 1.315 1.706 2.056 2.162 2.479 2.779 3.067 3.435 3.707
27 0.684 0.855 1.057 1.314 1.703 2.052 2.158 2.473 2.771 3.057 3.421 3.690
28 0.683 0.855 1.056 1.313 1.701 2.048 2.154 2.467 2.763 3.047 3.408 3.674
29 0.683 0.854 1.055 1.311 1.699 2.045 2.150 2.462 2.756 3.038 3.396 3.659
30 0.683 0.854 1.055 1.310 1.697 2.042 2.147 2.457 2.750 3.030 3.385 3.646
40 0.681 0.851 1.050 1.303 1.684 2.021 2.123 2.423 2.704 2.971 3.307 3.551
50 0.679 0.849 1.047 1.299 1.676 2.009 2.109 2.403 2.678 2.937 3.261 3.496
60 0.679 0.848 1.045 1.296 1.671 2.000 2.099 2.390 2.660 2.915 3.232 3.460
80 0.678 0.846 1.043 1.292 1.664 1.990 2.088 2.374 2.639 2.887 3.195 3.416
100 0.677 0.845 1.042 1.290 1.660 1.984 2.081 2.364 2.626 2.871 3.174 3.390
6
1000 0.675 0.842 1.037 1.282 1.646 1.962 2.056 2.330 2.581 2.813 3.098 3.300
10000 0.675 0.842 1.036 1.282 1.645 1.960 2.054 2.327 2.576 2.808 3.091 3.291
100000 0.674 0.842 1.036 1.282 1.645 1.960 2.054 2.326 2.576 2.807 3.090 3.291
50% 60% 70% 80% 90% 95% 96% 98% 99.0% 99.5% 99.8% 99.9%
Confidence level C

Correlation and Regression

Uploaded by

Correlation and Regression

Uploaded by

CORRELATION AND REGRESSION

Correlation is a statistical measure that indicates the linear relationship or association

For example, a correlation value of 0.51 would be a moderate positive correlation.

SCATTER PLOT OF CORRELATION COEFFICIENT

Positive Linear Negative LinearNo RelationshipNone or Weak

Weak LinearStrong LinearNonlinearLinear or

Where n = number of data pairs

Pearson r is used for determining the degree of significance of the correlationcoefficient.

Where the degrees of freedom (df) = n – 2.

The null hypothesis is rejected at a specific level if there is a significant

STEPS IN HYPOTHESIS TESTING

Pre-training weight (X) 67 84 65 69 72 74 66 69 73 80

Solution(a)Arrange the data in example 4.21 as shown in the table below.

Substitute the values to the formula of r.

10 ( 50,547 ) −( 719 ) ( 699 )

2. Level of significance: α = 0.05

3. Critical value of t-test – 2 tailed

CORRELATION AND CAUSATION

In a simple relationship (x, y), it is composed of an independent or predictor variable and

Where Y = predicted value ofthe dependentvariable

Coefficient of Determination (r2) = r2× 100%

Slope of the regression line

n ( Σxy )−( Σx )( ∑ y ) 10 ( 50.547 )−( 719 )( 699 )

Y = a +bx = 8.965 + 0.8475x

Y = 8.965 + 0.8475(65) = 64.0525

Calculating the coefficient of determination.

r2 = (0.98392) 2 × 100% = 96.81%

table shows the upper tail area

for example, with 10 degrees of freedom,

You might also like