Unit-III (Data Analytics)
Unit-III (Data Analytics)
(Data Analytics)
Regression – Concepts, Blue property assumptions, Least Square Estimation,
Variable Rationalization, and Model Building etc. Logistic Regression: Model
Theory, Model fit Statistics, Model Construction, Analytics applications to
various Business Domains etc.
Covariance:
Covariance is a measure of how much two random variables vary together. It’s
similar to variance, but where variance tells you how a single variable varies,
covariance tells you how two variables vary together.
A positive covariance would indicate a positive linear relationship between the
variables, and a negative covariance would indicate the opposite.
Where:
Xi – the values of the X-variable
Yj – the values of the Y-variable
X̄ – the mean (average) of the X-variable
Ȳ – the mean (average) of the Y-variable
n – the number of data points
Where
Variables: x = Independent Variable (we provide this)
y= Dependent Vanable (we observe this)
Parameters:
β0= Intercept
The y-intercept of a line is the point at which the line crosses the y axis. ( i.e.
where the x value equals 0)
β1= Slope
Change in the mean of Y for a unit change in X
ε = residuals
The residual value is a discrepancy between the actual and the
predicted value. The distance of the plotted points from the line gives
the residual value.
Consider examples:
Positive relation
Negative relation
Constructing Regression model for following samples
x 1 2 3 4 5
y 2 4 5 4 5
The procedure to find the best fit is called the least squares method. The distance
of the plotted points from the line gives the residual value.
The RMSE is the square root of the variance of the residuals. Lower values
of RMSE indicate better fit. RMSE is a good measure of how accurately the model
predicts the response.
R Square Method – Goodness of Fit
R-Squared (R² or the coefficient of determination) is a statistical measure in a
regression model that determines the proportion of variance in the dependent
variable that can be explained by the independent variable. In other words, r-
squared shows how well the data fit the regression model (the goodness of fit).
Unbiasedness:
Formally, an estimator is an unbiased estimator if its sampling distribution has as
its expected value equal to the true value of population.
Minimum Variance:
Just as we wanted the mean of the sampling distribution to be centered around
the true population, so too it is desirable for the sampling distribution to be as
narrow (or precise) as possible.
Efficiency:
An estimator is efficient when it possess both the previous properties that is
unbiased and has minimum variance as compare with any other unbiased
estimator
Rationalizing is a way of describing, interpreting, or explaining something that
makes it seem proper, more attractive, etc. Data Rationalization is a Managed
Meta Data which creates/extends an ontology for a domain into the
structured data world, based on model objects stored in various models.
Logistic regression:
Logistic regression, or Logit regression, or Logit model is a regression model where
the dependent variable (DV) is categorical.
Types of Logistic Regression:
Binary Logistic Regression.
Multinomial Logistic Regression.
Binary Logistic Regression:
Logistic Regression is a classification algorithm. It is used to predict a binary
outcome (1 / 0, Yes / No, True / False) given a set of independent variables.
Logistic Regression is used to solve the classification problems, so it’s called as
Classification Algorithm that models the probability of output class.
It is a classification problem where your target element is categorical.
Unlike in Linear Regression, in Logistic regression the output required is
represented in discrete values like binary 0 and 1.
It estimates relationship between a dependent variable (target) and one or
more independent variable (predictors) where dependent variable is
categorical/nominal.
It resembles an S-shaped curve.
Logit Odds:
We want to find the relationship between this probability and the p explanatory
variables, Xl, X 2 , ... ,Xp. The multiple logistic regression model then is
For j = 1,2,···, (k - 1). The model parameters are estimated by the method of
maximum likelihood. Statistical software is available to do this fitting.
Hosmer Lemeshow Test:
The Hosmer–Lemeshow test is a statistical test for goodness of fit for logistic
regression models.
It is used frequently in risk prediction models.
The test assesses whether or not the observed event rates match expected
event rates in subgroups of the model population.
The Hosmer–Lemeshow test specifically identifies subgroups as the deciles of
fitted risk values.
Models for which expected and observed event rates in subgroups are
similar are called well calibrated.
The Hosmer–Lemeshow test statistic is given by:
Here Og, Eg, Ng, and πg denote the observed events, expected events,
observations, predicted risk for the gth risk decile group, and G is the number of
groups.
Error Matrix: A confusion matrix, also known as a contingency table or an error
matrix , is a specific table layout that allows visualization of the performance of an
algorithm, typically a supervised learning one (in unsupervised learning it is
usually called a matching matrix). Each column of the matrix represents the
instances in a predicted class while each row represents the instances in an actual
class (or vice-versa).
A table of confusion (sometimes also called a confusion matrix), is a table with two
rows and two columns that reports the number of false positives, false negatives,
true positives, and true negatives.