Unit-III (Data Analytics)

UNIT-III
(Data Analytics)
Regression – Concepts, Blue property assumptions, Least Square Estimation,
Variable Rationalization, and Model Building etc. Logistic Regression: Model
Theory, Model fit Statistics, Model Construction, Analytics applications to
various Business Domains etc.
Covariance:
Covariance is a measure of how much two random variables vary together. It’s
similar to variance, but where variance tells you how a single variable varies,
covariance tells you how two variables vary together.
A positive covariance would indicate a positive linear relationship between the
variables, and a negative covariance would indicate the opposite.
The Covariance Formula:
Where:
Xi – the values of the X-variable
Yj – the values of the Y-variable
X̄ – the mean (average) of the X-variable
Ȳ – the mean (average) of the Y-variable
n – the number of data points
In R, the covariance of x and y is by using cov(x, y) function.

Correlation:
The Correlation is a measure of association between two variables. Correlations
are Positive and negative which are ranging between
+1 and -1.
Degree and type of relationship between any two or more quantities (variables) in
which they vary together over a period.
for example, variation in the level of expenditure or savings with variation in the
level of income.
A positive correlation exists where the high values of one variable are associated
with the high values of the other variable(s). A 'negative correlation' means
association of high values of one with the low values of the other(s). Values close to
+1 indicate a high degree of positive correlation, and values close to -1 indicate a
high degree of negative correlation. Values close to zero indicate poor correlation
of either kind, and 0 indicates no correlation at all.
Regression Analysis:
Regression analysis is a set of statistical processes for estimating the relationships
between a dependent variable (often called the 'outcome variable‘ or ‘ Response
Variable’ ) and one or more independent variables (often called 'predictors', or '
explanatory variables ').
Simple Linear Regression:
Simple linear regression is used to predict the value of one variable (the
dependent variable) on the basis of other variables (the independent
variables). Simple linear regression is an approach for predicting
a response using a single feature. A line is fitted through the group of plotted
data.
Where
Variables: x = Independent Variable (we provide this)
y= Dependent Vanable (we observe this)
Parameters:
β0= Intercept
The y-intercept of a line is the point at which the line crosses the y axis. ( i.e.
where the x value equals 0)
β1= Slope
Change in the mean of Y for a unit change in X
ε = residuals
The residual value is a discrepancy between the actual and the
predicted value. The distance of the plotted points from the line gives
the residual value.
Consider examples:
Positive relation
Negative relation
Constructing Regression model for following samples
x 1 2 3 4 5
y 2 4 5 4 5
The procedure to find the best fit is called the least squares method. The distance
of the plotted points from the line gives the residual value.
Applications of Linear Regression:

1. Evaluating Trends and Sales Estimates
2. Analyzing the impact of Price changes
3. Assessment of risk in financial services and insurance domain
1). Evaluating Trends and Sales Estimates
Linear regressions can be used in business to evaluate trends and make estimates
or forecasts. For example, if a company’s sales have increased steadily every
month for the past few years, conducting a linear analysis on the sales data with
monthly sales on the y-axis and time on the x-axis would produce a line that that
depicts the upward trend in sales. After creating the trend line, the company could
use the slope of the line to forecast sales in future months.
2). Analyzing the impact of Price changes
Linear regression can also be used to analyze the effect of pricing on consumer
behavior. For example, if a company changes the price on a certain product
several times, it can record the quantity it sells for each price level and then
performs a linear regression with quantity sold as the dependent variable and
price as the explanatory variable. The result would be a line that depicts the extent
to which consumers reduce their consumption of the product as prices increase,
which could help guide future pricing decisions.
3). Assessment of risk
Linear regression can be used to analyze risk. For example, A health insurance
company might conduct a linear regression plotting number of claims per
customer against age and discover that older customers tend to make more health
insurance claims. The results of such an analysis might guide important business
decisions made to account for risk.
Assumptions of Linear Regression Model:
Linear regression is a useful statistical method we can use to understand the
relationship between two variables, x and y. However, before we conduct linear
regression, we must first make sure that four assumptions are met:
1. Linear relationship: There exists a linear relationship between the
independent variable, x and the dependent variable, y.
2. Independence: The residuals are independent. In particular, there is no
correlation between consecutive residuals in data. (Multicollinearity occurs
when independent variables in a regression model are correlated.
This correlation is a problem because independent variables should
be independent)
3. Homoscedasticity: The residuals have constant variance at every level of x.
4. Normality: The residuals of the model are normally distributed.
If one or more of these assumptions are violated, then the results of our linear
regression may be unreliable or even misleading.
Regression Model in R Language:
lm() Function
This function creates the relationship model between the predictor and the
response variable.
Syntax
The basic syntax for lm() function in linear regression is:
lm(formula , data)
Following is the description of the parameters used:
 formula is a symbol presenting the relation between x and y.
 data is the vector on which the formula will be applied.
Assessing the fit of regression models:
A well-fitting regression model results in predicted values close to the observed
data values. The mean model, which uses the mean for every predicted value,
generally would be used if there were no informative predictor variables. The fit of
a proposed regression model should therefore be better than the fit of the mean
model.
RMSE (Root Mean Square Error):
Root Mean Square Error (RMSE) is the standard deviation of
the residuals (prediction errors). Residuals are a measure of how far from the
regression line data points are; RMSE is a measure of how spread out these
residuals are. In other words, it tells you how concentrated the data is around
the line of best fit. Root mean square error is commonly used in climatology,
forecasting, and regression analysis to verify experimental results.
The RMSE is the square root of the variance of the residuals. Lower values
of RMSE indicate better fit. RMSE is a good measure of how accurately the model
predicts the response.
R Square Method – Goodness of Fit
R-Squared (R² or the coefficient of determination) is a statistical measure in a
regression model that determines the proportion of variance in the dependent
variable that can be explained by the independent variable. In other words, r-
squared shows how well the data fit the regression model (the goodness of fit).
R-squared is always between 0 and 100%

 0% indicates that the model explains none of the variability of the
response data around its mean.
 100% indicates that the model explains all the variability of the
response data around its mean.
Properties of Blue:
 B-BEST
 L-LINEAR
 U-UNBIASED
 E-ESTIMATOR
An estimator is BLUE if the following hold:
 1. It is linear (Regression model)
 2. It is unbiased
 3. It is an efficient estimator(unbiased estimator with least variance)
Linearity:
An estimator is said to be a linear estimator of (β) if it is a linear function of the
sample observations
Unbiasedness:
Formally, an estimator is an unbiased estimator if its sampling distribution has as
its expected value equal to the true value of population.
Minimum Variance:
Just as we wanted the mean of the sampling distribution to be centered around
the true population, so too it is desirable for the sampling distribution to be as
narrow (or precise) as possible.
Efficiency:
An estimator is efficient when it possess both the previous properties that is
unbiased and has minimum variance as compare with any other unbiased
estimator
Rationalizing is a way of describing, interpreting, or explaining something that
makes it seem proper, more attractive, etc. Data Rationalization is a Managed
Meta Data which creates/extends an ontology for a domain into the
structured data world, based on model objects stored in various models.
Logistic regression:
Logistic regression, or Logit regression, or Logit model is a regression model where
the dependent variable (DV) is categorical.
Types of Logistic Regression:
 Binary Logistic Regression.
 Multinomial Logistic Regression.
Binary Logistic Regression:
Logistic Regression is a classification algorithm. It is used to predict a binary
outcome (1 / 0, Yes / No, True / False) given a set of independent variables.
Logistic Regression is used to solve the classification problems, so it’s called as
Classification Algorithm that models the probability of output class.
 It is a classification problem where your target element is categorical.
 Unlike in Linear Regression, in Logistic regression the output required is
represented in discrete values like binary 0 and 1.
 It estimates relationship between a dependent variable (target) and one or
more independent variable (predictors) where dependent variable is
categorical/nominal.
 It resembles an S-shaped curve.
Logit Odds:
Assumptions for Logistic Regression:

 The Dependent Variable Must Be Categorical In Nature.
 The Independent Variable Should Not Have Multi-Collinearity.
Properties of Logistic Regression:
 The dependent variable in logistic regression follows Bernoulli Distribution.
 Estimation is done through maximum likelihood.
 No R Square, Model fitness is calculated through Concordance, KS-
Statistics.
Regression residuals:
The residual of an observed value is the difference between the observed value
and the estimated value of the quantity of interest. Because a linear regression
model is not always appropriate for the data, assess the appropriateness of the
model by defining residuals and examining residual plots.
Residuals:
The difference between the observed value of the dependent variable (y) and the
predicted value (ŷ) is called the residual (e). Each data point has one residual.
Residual = Observed value - Predicted value
e = y –ŷ
Ordinary least squares (OLS) Method:
Ordinary least squares (OLS) or linear least squares is a method for estimating the
unknown parameters in a linear regression model, with the goal of minimizing the
differences between the observed responses and the predicted responses by the
linear approximation of the data.
MLE -> Maximum Likelihood Estimation
Maximum likelihood estimation, or MLE, is a method used in estimating the
parameters of a statistical model, and for fitting a statistical model to data. If you
want to find the height measurement of every basketball player in a specific
location, you can use the maximum likelihood estimation. Normally, you would
encounter problems such as cost and time constraints. If you could not afford to
measure all of the basketball players’ heights, the maximum likelihood estimation
would be very handy. Using the maximum likelihood estimation, you can estimate
the mean and variance of the height of your subjects. The MLE would set the
mean and variance as parameters in determining the specific parametric values in
a given model.
Multinomial Logistic Regression:
Multinomial Logistic Regression is the regression analysis to conduct when the
dependent variable is nominal with more than two levels. Similar to multiple
linear regressions, the multinomial regression is a predictive analysis. Multinomial
regression is used to explain the relationship between one nominal dependent
variable and one or more independent variables.
Standard linear regression requires the dependent variable to be measured on a
continuous (interval or ratio) scale. Binary logistic regression assumes that the
dependent variable is a stochastic event. The dependent variable describes the
outcome of this stochastic event with a density function (a function of cumulated
probabilities ranging from 0 to 1). A cut point (e.g., 0.5) can be used to determine
which outcome is predicted by the model based on the values of the predictors.
We want to find the relationship between this probability and the p explanatory
variables, Xl, X 2 , ... ,Xp. The multiple logistic regression model then is
Since all the 7r'S add to unity, this reduces to
For j = 1,2,···, (k - 1). The model parameters are estimated by the method of
maximum likelihood. Statistical software is available to do this fitting.
Hosmer Lemeshow Test:
 The Hosmer–Lemeshow test is a statistical test for goodness of fit for logistic
regression models.
 It is used frequently in risk prediction models.
 The test assesses whether or not the observed event rates match expected
event rates in subgroups of the model population.
 The Hosmer–Lemeshow test specifically identifies subgroups as the deciles of
fitted risk values.
 Models for which expected and observed event rates in subgroups are
similar are called well calibrated.
 The Hosmer–Lemeshow test statistic is given by:
Here Og, Eg, Ng, and πg denote the observed events, expected events,
observations, predicted risk for the gth risk decile group, and G is the number of
groups.
Error Matrix: A confusion matrix, also known as a contingency table or an error
matrix , is a specific table layout that allows visualization of the performance of an
algorithm, typically a supervised learning one (in unsupervised learning it is
usually called a matching matrix). Each column of the matrix represents the
instances in a predicted class while each row represents the instances in an actual
class (or vice-versa).
A table of confusion (sometimes also called a confusion matrix), is a table with two
rows and two columns that reports the number of false positives, false negatives,
true positives, and true negatives.
Regression Vs Logistic (Classification):

Data Visualization using Tableau:
Tableau is a Data Visualisation tool that is widely used for Business Intelligence
but is not limited to it. It helps create interactive graphs and charts in the form of
dashboards and worksheets to gain business insights. And all of this is made
possible with gestures as simple as drag and drop!
Tableau Public
Tableau Public is purely free of all costs and does not require any license. But it
comes with a limitation that all of your data and workbooks are made public to all
Tableau users.
Tableau Online
Tableau Online is the best option for you, if you wish to make your Workbooks on
the Cloud and be able to access them from anywhere.

Unit-III (Data Analytics)

Uploaded by

Unit-III (Data Analytics)

Uploaded by

UNIT-III

The Covariance Formula:

In R, the covariance of x and y is by using cov(x, y) function.

Applications of Linear Regression:

R-squared is always between 0 and 100%

Assumptions for Logistic Regression:

Since all the 7r'S add to unity, this reduces to

Regression Vs Logistic (Classification):

You might also like