Introduction to
Data Science
Module 2
Week 2
Review of Descriptive and Inferential Statistics
Data Processing and Visualization with R
Module Objectives
At the end of this module, students must be able to:
1. Differentiate the two areas of statistics: descriptive and
inferential;
2. Perform simple linear regression in Excel and in R along with
pertinent visual output;
3. Perform multiple linear regression in Excel and in R along with
pertinent visual output;
Statistics Refresher
Collection
DESCRIPTIVE Organization
Presentation
STATISTICS
Draw conclusions for a larger group/data
INFERENTIAL Determine relationships
Make predictions
Statistics Refresher
DESCRIPTIVE
Point
STATISTICS Probability
Estimation
INFERENTIAL Interval
Hypothesis
Testing
The Process of Statistics
Sampling Theory
POPULATION SAMPLE
Descriptive Statistics
Inferential Statistics
PARAMETER STATISTIC
Stat Refresher: Regression Analysis
Regression Analysis:
Statistical technique used most frequently to analyze the
relationship between two or more variables.
At least two variables need to be continuous
Deals with the way one variable tends to change as one or
more other variables change
Example
• Input the data
• Create a scatter plot
• Add trend line
When to use regression?
Regression analysis is used to describe the relationship between:
A single response variable Y; and
One or more predictor variables: 𝑋1,𝑋2,…,𝑋𝑝
p = 1 : Simple regression
p > 1 : Multivariate regression
Examples:
how sales (Y) vary with advertising expenditures (X)
how quantity demanded (Y) varies with prices (X)
relationship between corporate profit (Y) and R&D spending (X)
The Variables
Response Variables
- The response variable Y must be a continuous variable.
Predictor Variables
- The predictors 𝑋1,𝑋2,…,𝑋𝑝 can be continuous, discrete or
categorical variables
Initial EDA
Prior to any regression modelling, the data should always be
inspected for:
Data-entry errors
Missing values
Outliers
Unusual (e.g., asymmetric)distributions
Changes in Variability
Clustering
Non-linear bivariate relationships
Unexpected pattern
Simple Linear Regression
The Variables
X : explanatory variable (horizontal axis)
Y : response variable (vertical axis)
After data collection, we have pairs of observations:
(𝑋1,𝑌1),…,(𝑋𝑛,𝑌𝑛)
Sample Data 1
Variables: X (Height), Y (Weight)
We want to be able to describe the weight as a linear function
of height
Sample Data 1
Weight ≈ + Height
The regression of variable Y on variable X is given by:
= 𝛽0 + 𝛽1𝑋1 + 𝜖1 ; 𝑖 = 1,…,𝑛
where:
Random Error : 𝜖𝑖 ~ 𝑁(0,𝜎2), independent and identical
Linear Function : 𝛽0 + 𝛽1𝑋1 = 𝐸(𝑌|𝑋 = 𝑋𝑖)
Unknown Parameters:
𝛽0 (intercept) : point in which the line intercepts the y-axis;
𝛽1 (slope) : increase in Y per unit change in X.
Estimation of Unknown Parameters I
We want to find the equation of the
line that “best” fits the data. It
means finding 𝑏0 and 𝑏1 such that the fitted values of 𝑦𝑖, given by:
= 𝑏0 + 𝑏1𝑥1
are as “close” as possible to the observed values 𝑦𝑖.
Residuals
The difference between the observed value 𝑦𝑖 and the fitted value
is called residual and is given by:
𝑒𝑖= 𝑦𝑖 −
Residuals
Estimation of Unknown Parameters II
Least Squares Method
A usual way of calculating 𝑏0 and 𝑏1 is based on the minimization
of the sum of the squared residuals, or residual sum of squares
(RSS):
Sample Data 1
Multivariate Regression
From SLR to MLR
It is not often the case that dependent variable is explained by
exactly one variable.
We use multiple regression to attempt to predict the dependent
variable using more than one independent variable.
Multiple regressions can be linear and nonlinear. We use
Multiple Linear Regression for explanation, prediction, and
inference.
Sample Data 1
Example: Advertising
1. Perform SLR on each predictor variable.
2. Interpret the results.
3. Perform MLR.
4. Interpret the results.
Predicting Values in MVR
1. What is the predicted sales when TV = 115, Radio = 45,
Newspaper = 41?
2. What is the predicted sales when TV = 195, Radio = 62,
Newspaper = 155?