Correlation and Regression
Correlation and Regression
Discussion
When comparing two different variables, there are two questions to be considered: “Is
there a relationship between two variables?” and “How strong is that relationship?” These
questions can be answered using regression and correlation. Regressionanswers whetherthere is a
relationship and correlation answers how strong the linear relationship is. A scatter plot is a
graphical representation of the relation between two or more variables. Inthe scatter plot of two
variables x and y, each point on the plot is an x-y pair. An inspection of a scatterplot can give an
impression of whether two variables are related and the direction of their relationship.
Since correlation is an effect sized, we can describe the strength of the correlation using
the guide that Evans (1996) suggested for the absolute value of r:
• .00-.19 very weak or negligible correlation
• .20-.39 weak or low correlation
• .40-.59 moderate correlation
• .60-.79 strong or moderately high correlation
• .80-1.0 very strong or high correlation
n ( Σxy )− ( ∑ x ) ( ∑ y )
r=
√ [ n ( Σ x ) −( ∑ x ) ][ n ( Σ y ) −( ∑ y ) ] ❑
2 2 2 2
The null hypothesisstates that there is no significant correlation between the two
variables X and Y. That is, if ρ is the true correlation coefficient for the two variables X and Y
when all population values are observed, then the null hypothesis is Ho: ρ = 0.
The alternative hypothesis could be any one of three forms, with Ha: ρ ≠0, ρ< 0, or ρ> 0.
This means that there is a significant correlation between the two variables X and Y.
The test statistic for the hypothesis test is the sample or observed correlation coefficient.
The T-test is a test statistic used to assess the relationship on various samples that are drawn
ineach of sample size. The formula of the t-test is:
t=r
√ n−2
1−r 2
1. State the null hypothesis (Ho) and the alternative hypothesis (Ha).
2. Choose the level of significance α.
3. Select the appropriate test statistic and establish a critical region. Formulate decision rule.
4. Collect the data and compute the value of the test statistic from the sample data.
5. Make a decision. Reject Ho if the value of the test statistic is in the critical region,otherwise,
do not reject Ho.
EXAMPLE 4.21
A coach investigated the efficiency of physical training on the weight of his athletes. Pre-
training and post-training weights are listed below, determine if (a) there is a relationship between
the two variables; (b) physical training is effective.
Trainee X Y XY X2 Y2
A 67 64 4288 4489 4096
B 84 80 6720 7056 6400
C 65 65 4225 4225 4225
D 69 67 4623 4761 4489
E 72 70 5040 5184 4900
F 74 73 5402 5476 5329
G 66 66 4356 4356 4356
H 69 67 4623 4761 4489
I 73 70 5110 5329 4900
J 80 77 6160 6400 5929
Total 719 699 50547 52037 49113
n ( Σxy )− ( ∑ x ) ( ∑ y )
r=
√ [ n ( Σ x ) −( ∑ x ) ][ n ( Σ y ) −( ∑ y ) ] ❑
2 2 2 2
r =0.98392
The correlation coefficient 0.98392 tells that there is a very strong positive correlation
between the pre-training and post-training of the athletes.
(b) Test the Significance of the correlation. Use α = 0.05; n =10; and = 0.98392
1. Ho: ρ = 0
Ha: ρ≠ 0
4. Computation:
t C =r
√ n−2
1−r 2
=0.98392
10−2
√
1−(0.98392)2
=15.53332
5. Since tcis greater than ta/2, then reject Ho. 15.53332 >2.306 Ho. There is a significant difference
in the pre-training and post-training weights of the athletes. Moreover, it has a very strong
positive correlation. Therefore, the physical training is effective.
When two variables have a large positive or negative correlation with each other, there is often a
tendency to regard the two variables as causally related. The following relationship occur:
1. There is a direct cause and effect relationship between the variables; that is; x caused, y.
2. There is a reverse cause and effect relationship between the variables; that is,y caused, x.
3. The relationship between the variables may be caused by a third variable.
4. There may be complexity of interrelationship among many variables.
5. The relationship may be coincidental.
Y = a + bx
a = the y-intercept ; a =
Σy
n
−b ( )
∑x
n
→ y−b x
n ( Σxy )−( Σx ) ( ∑ y )
b = slope of the line; b=
n ( Σ x2 ) −( Σx )
2
Given data points (xi, yi)a and b shall now be chosen in the way that the corresponding
linear line will have the best fit for the given data. The criteria for best fit used in regression
analysis is the sum of the squared differences between the data points and the line itself, that is
the y deviations; thus, the closer the points to the line, the better the fit and the prediction will be.
The coefficient of determination (r2) is the number that expresses the proportion of the
total variation in the values of the dependent variables. The coefficient of determination can be
obtained by squaring the correlation coefficient r.
EXAMPLE 4.22
From the example 21, Use the equation of regression line to predict the efficiency of the
physical training given that the standard athlete’s weight is 65 and determine how much
variations of athletes’ weight is due to the variations of the physical training.
Trainee X Y XY X2
A 67 64 4288 4489
B 84 80 6720 7056
C 65 65 4225 4225
D 69 67 4623 4761
E 72 70 5040 5184
F 74 73 5402 5476
G 66 66 4356 4356
H 69 67 4623 4761
I 73 70 5110 5329
J 80 77 6160 6400
Total 719 699 50547 52037
Solution:
y-intercept
a=
Σy
n
−b
∑x
n( ) =
699
10
−( 0.8475)
719
10
= 8.965
Regression equation:
Solve for the efficiency of the physical training if the standard athlete’s weight is 65.
∴From the physical training given, the athletes can obtain a weight of 64, and 96.81% of
the variation in physical training is accounted for the variations in the weight of the athletes. The
remaining 3.19% is caused by other variables, whi ch is called the coefficient of
alienation.
Probability p
Critical Values of t
Tail Probability p
df 0.25 0.2 0.15 0.1 0.05 0.025 0.02 0.01 0.005 0.0025 0.001 0.0005
1 1.000 1.376 1.963 3.078 6.314 12.706 15.895 31.821 63.657 127.321 318.309 636.619
2 0.816 1.061 1.386 1.886 2.920 4.303 4.849 6.965 9.925 14.089 22.327 31.599
3 0.765 0.978 1.250 1.638 2.353 3.182 3.482 4.541 5.841 7.453 10.215 12.924
4 0.741 0.941 1.190 1.533 2.132 2.776 2.999 3.747 4.604 5.598 7.173 8.610
5 0.727 0.920 1.156 1.476 2.015 2.571 2.757 3.365 4.032 4.773 5.893 6.869
6 0.718 0.906 1.134 1.440 1.943 2.447 2.612 3.143 3.707 4.317 5.208 5.959
7 0.711 0.896 1.119 1.415 1.895 2.365 2.517 2.998 3.499 4.029 4.785 5.408
8 0.706 0.889 1.108 1.397 1.860 2.306 2.449 2.896 3.355 3.833 4.501 5.041
9 0.703 0.883 1.100 1.383 1.833 2.262 2.398 2.821 3.250 3.690 4.297 4.781
10 0.700 0.879 1.093 1.372 1.812 2.228 2.359 2.764 3.169 3.581 4.144 4.587
11 0.697 0.876 1.088 1.363 1.796 2.201 2.328 2.718 3.106 3.497 4.025 4.437
12 0.695 0.873 1.083 1.356 1.782 2.179 2.303 2.681 3.055 3.428 3.930 4.318
13 0.694 0.870 1.079 1.350 1.771 2.160 2.282 2.650 3.012 3.372 3.852 4.221
14 0.692 0.868 1.076 1.345 1.761 2.145 2.264 2.624 2.977 3.326 3.787 4.140
15 0.691 0.866 1.074 1.341 1.753 2.131 2.249 2.602 2.947 3.286 3.733 4.073
16 0.690 0.865 1.071 1.337 1.746 2.120 2.235 2.583 2.921 3.252 3.686 4.015
17 0.689 0.863 1.069 1.333 1.740 2.110 2.224 2.567 2.898 3.222 3.646 3.965
18 0.688 0.862 1.067 1.330 1.734 2.101 2.214 2.552 2.878 3.197 3.610 3.922
19 0.688 0.861 1.066 1.328 1.729 2.093 2.205 2.539 2.861 3.174 3.579 3.883
20 0.687 0.860 1.064 1.325 1.725 2.086 2.197 2.528 2.845 3.153 3.552 3.850
21 0.686 0.859 1.063 1.323 1.721 2.080 2.189 2.518 2.831 3.135 3.527 3.819
22 0.686 0.858 1.061 1.321 1.717 2.074 2.183 2.508 2.819 3.119 3.505 3.792
23 0.685 0.858 1.060 1.319 1.714 2.069 2.177 2.500 2.807 3.104 3.485 3.768
24 0.685 0.857 1.059 1.318 1.711 2.064 2.172 2.492 2.797 3.091 3.467 3.745
25 0.684 0.856 1.058 1.316 1.708 2.060 2.167 2.485 2.787 3.078 3.450 3.725
26 0.684 0.856 1.058 1.315 1.706 2.056 2.162 2.479 2.779 3.067 3.435 3.707
27 0.684 0.855 1.057 1.314 1.703 2.052 2.158 2.473 2.771 3.057 3.421 3.690
28 0.683 0.855 1.056 1.313 1.701 2.048 2.154 2.467 2.763 3.047 3.408 3.674
29 0.683 0.854 1.055 1.311 1.699 2.045 2.150 2.462 2.756 3.038 3.396 3.659
30 0.683 0.854 1.055 1.310 1.697 2.042 2.147 2.457 2.750 3.030 3.385 3.646
40 0.681 0.851 1.050 1.303 1.684 2.021 2.123 2.423 2.704 2.971 3.307 3.551
50 0.679 0.849 1.047 1.299 1.676 2.009 2.109 2.403 2.678 2.937 3.261 3.496
60 0.679 0.848 1.045 1.296 1.671 2.000 2.099 2.390 2.660 2.915 3.232 3.460
80 0.678 0.846 1.043 1.292 1.664 1.990 2.088 2.374 2.639 2.887 3.195 3.416
100 0.677 0.845 1.042 1.290 1.660 1.984 2.081 2.364 2.626 2.871 3.174 3.390
6
1000 0.675 0.842 1.037 1.282 1.646 1.962 2.056 2.330 2.581 2.813 3.098 3.300
10000 0.675 0.842 1.036 1.282 1.645 1.960 2.054 2.327 2.576 2.808 3.091 3.291
100000 0.674 0.842 1.036 1.282 1.645 1.960 2.054 2.326 2.576 2.807 3.090 3.291
50% 60% 70% 80% 90% 95% 96% 98% 99.0% 99.5% 99.8% 99.9%
Confidence level C