Statistical Data Science
Statistical Data Science
i) Statistical data analysis is the Collection & intpreting of data in order to uncover pattern of trends
ii) It is component of data Analysis.
iii) It is the science of collecting exploring & preventing large amount of data to discover under lying pattern of
trends.
What is the role of Statistics in Data Science?
i.Data Exploration:-
Basic statistical descriptions can be used to learn more about each feature such as mean (average value), median ,
and mode , central tendency, which give us an idea of the "middle" or center of distribution.
ii.Data Cleaning:-
Knowing basic statistics regarding each feature makes it easier to fill in missing values, smooth noisy values, finding
outliers can also help in fixing inconsistencies incurred during data integration, and filling missing values.
iii.Data Transformation :-
This requires data sampling and feature selection methods, data transforms, scaling, normalization, and encoding.
Probability distribution and estimation are required in machine learning and algorithms.
v. Data Visualization :-
Plotting the measures of central tendency shows as if data are symmetric or skewed.
Plots such as quantile plots, histograms, scatter ploe treemaps, correlation heat maps
and other data visualizations types give much more powerful insights than plain data
and it also makes the data more readable and interesting.
Define Descriptive Statistics. List its categories.
Descriptive statistics provides ways for describing, presenting, summarizing, and organizing the data, either through
numerical calculations or graphs or tables.
It helps us to organize, represent and describe data using tables, graphs, and summary measures.
Descriptive Statistics
Mean Range
Interquartile Range
Mode
Standard Deviation
Median
Variance
Explain the measures of central tendency in brief.
i. Mean :-
The most common and effective numeric measure of the "center" of a set of data is the mean.
It is the sum of all the observations divided by the sample size.
i) Arithmetic mean :-
The arithmetic mean is simply obtained by adding all the values and then dividing the sum by the total number of
values.
Ii) Harmonic mean :-
The harmonic mean is calculated as the number of values N divided by the sum of the reciprocal of the values (1 over
each value).
iii) Geometric Mean :-
A geometric mean is a mean or average which shows the central tendency of a set of numbers by using the product
of their values.
ii. Median :-
It is the middle value of data.
It is the value that separates the higher half of a data set from the lower half.
It splits the data in half and also called 50th percentile.
If the number of elements in the dataset is odd, the middle element is the median.
If the number of elements in the dataset is even, the median would be the average of two central elements.
Let us calculate the median of marks obtained by 10 students in a quiz: 8, 3, 7, 6, 9, 10, 5, 7, 8, 5.
We first arrange them in increasing order 3,5,5,6,7,7,8,8,9,10. Since there are even number of elements, we take the
average (7+7) of the middle two values, i.e., 2 = 7.
Advantages :-
For skewed (asymmetric) data, a better measure of the center of data is the median.
Dis-advantages :-
The median is expensive to compute when we have a large number of observations.
iii. Mode :-
The mode is another measure of central tendency.
It is the value that occurs more frequently in a dataset.
It is possible for several different values to have the maximum frequency, which results in more than one mode.
Data sets with one, two, or three modes are respectively called unumodal, bimodal, and trimodal.
Advantages :-
1. It can be determined for qualitative and quantitative attributes.
2. It is not affected by extreme values.
Disadvantages :-
1. Mode is not applicable for further statistical analysis and algebraic calculation.
Define Inferential Statistics. List its categories.
Statistical inference is a method of making decisions about the parameters of a population, based on random
sampling.
Statistical inference mainly deals with two different kinds of problem-hypothesis testing and estimation of
parameter values
Hypothesis testing is used to check whether a stated hypothesis is accepted or rejected.
Hypothesis testing can be classified as parametric tests and non-parametric tests.
There can be two hypotheses the null hypothesis (Ha) and the alternative hypothesis (H₂).
The Type I error occurs in hypothesis testing when a true null hypothesis is rejected
The Type Il error occurs in hypothesis testing when we accept a false null hypothesis.
Parametric tests are those tests for which we have prior knowledge of the population distribution
Important parametric tests are z-test, t-test, ANOVA and chi-square test.
Non-parametric tests are used when information about the population is unknown and hence no assumptions can
be made regarding the population.
The task of estimation of parameter values involves making inferences from a given sample about an unknown
population parameter.
Parameter estimation methods are point estimates or interval estimates.
Inferential Statistics
What is Outlier?
An outlier is a data point that differs significantly from other observations.
Explain the measures of Dispersion in brief.
i. Range :-
The range of the set is the difference between the largest ( max() ) and the smallest ( min() ) values.
In simple terms, it is the difference between the largest and smallest.value in the set.
Range = Max – Min
iii.Variance :-
Variance is calculated by finding the square of the standard deviation of given data distribution
Variance measures how far a data set is spread out.
It is mathematically defined as the average of the squared differences from the mean.
Unsupervised :-
In some application scenarios, objects labeled as “normal” or "outlier" are not available.
Thus, an unsupervised learning method has to be used.
Unsupervised outlier detection methods make an implicit assumption that the normal objects are somewhat
"clustered."
Semi-supervised :- Semi-supervised outlier detection methods use the available labeled normal objects together
with unlabeled objects that are close by, to train a model for normal objects. The model of normal objects then can
be used to detect outliers those