Assignment Answers
Assignment Answers
A1)
Often our raw dataset is comprised of attributes with varying scales.
For example, the age of employees in a company may be between 21-70 years,
the size of the house they live is 500-5000 sq. feet and their salaries may range
from $30000-$80000.
In this situation, the age feature will not play any role because it is several order
smaller than other features. However, it may contain some important information
that may be useful for the task.
Hence, we need to normalize the features independently to the same scale, say
[0,1], so that they can contribute equally while computing the distance.
We calculate normalization by:
x_new = (x-x_min)/(x_max-x_min )
the maximum number we can get after applying the formula is 1, and the
minimum number is 0. So here is one big characteristic all the numbers will be
between 0 and 1.
Data standardization is the process of rescaling one or more attributes so that
they have a mean value of 0 and a standard deviation of 1.
We calculate standardization by:
x_new = (x-μ)/σ
So, when we don’t know the distribution of our data or the distribution is not
Gaussian (a bell curve) we go for normalization, else if our data has a Gaussian
(a bell curve) distribution we go for standardization.
A2)
VIF=1/(1-r2)
1 = not correlated.
Between 1 and 5 = moderately correlated.
Greater than 5 = highly correlated.
2=1/(1-r2)
1-r2=½
1-½=r2
r2=0.5
r=0.707
As the r2 value is 50%, we can say that there is a moderate positive relationship
between two variables.
A3)
The chi-square test for independence is applied when you have two categorical
variables from a single population. It is used to determine whether there is a
significant association between the two variables.
For example, in a learning preference survey, students might be classified by
gender (male or female) and studying preference (online/books/classes).
We could use chi-square test for independence to determine whether gender is
related to studying preference.
The test procedure is appropriate when the following conditions are met:
This approach consists of four steps: (1) state the hypothesis, (2) formulate an
analysis plan, (3) analyse sample data, and (4) interpret results.
Variable gender has 2 levels, and variable studying preference has 3 levels.
The null hypothesis states that knowing the level of variable gender does not help
you predict the level of variable studying preference. That is, the variables are
independent.
The alternative hypothesis is that knowing the level of variable gender can help
you predict the level of variable studying preference.
For the analysis, we will use the significance level 0.05. Using sample data, we will
conduct a chi-square test for independence.
Analyze sample data
Applying the chi-square test for independence to sample data, we compute the
degrees of freedom, the expected frequency counts, and the chi-square test
statistic. Based on the chi-square statistic and the degrees of freedom, we
determine the P-value.
Interpret results
If the P-value is less than the significance level (0.05), we reject the null hypothesis
and we conclude that there is a relationship between gender and studying
preference.
A4)
Boxplots are a way of summarizing data through visualizing the five number
summary which consists of the minimum value, first quartile, median, third
quartile, and maximum value of a data set.
With the help of box plots we can easily determine the values which are beyond
the upper and lower limits which are considered as outliers and discard them from
our dataset before making any further observations for more accurate results.
A5)
Firstly we perform missing value analysis and check the percentage of values that
are missing for each variable of our dataset. If it’s less than 30% then we go for
imputation or else we remove all rows with the null values from our dataset.
Now when going for imputation we randomly choose one value from a particular
column and make it NA and save that value for checking in future whether our
predicted value is close to actual value or not.
After analysis the method which provides the value closest to actual value is used
and all the null values are imputated using that particular method.