0% found this document useful (0 votes)
9 views

Statistical Data Science

Uploaded by

Omkar Shinde
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Statistical Data Science

Uploaded by

Omkar Shinde
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

 Define Statistical Data Analysis.

i) Statistical data analysis is the Collection & intpreting of data in order to uncover pattern of trends
ii) It is component of data Analysis.
iii) It is the science of collecting exploring & preventing large amount of data to discover under lying pattern of
trends.
 What is the role of Statistics in Data Science?

i.Data Exploration:-

Basic statistical descriptions can be used to learn more about each feature such as mean (average value), median ,
and mode , central tendency, which give us an idea of the "middle" or center of distribution.

ii.Data Cleaning:-

Knowing basic statistics regarding each feature makes it easier to fill in missing values, smooth noisy values, finding
outliers can also help in fixing inconsistencies incurred during data integration, and filling missing values.

iii.Data Transformation :-

This requires data sampling and feature selection methods, data transforms, scaling, normalization, and encoding.

iv. Probability Distribution and Estimation :-

Probability distribution and estimation are required in machine learning and algorithms.

v. Data Visualization :-

Plotting the measures of central tendency shows as if data are symmetric or skewed.
Plots such as quantile plots, histograms, scatter ploe treemaps, correlation heat maps
and other data visualizations types give much more powerful insights than plain data
and it also makes the data more readable and interesting.
 Define Descriptive Statistics. List its categories.

Descriptive statistics provides ways for describing, presenting, summarizing, and organizing the data, either through
numerical calculations or graphs or tables.
It helps us to organize, represent and describe data using tables, graphs, and summary measures.

Descriptive Statistics

Measures of Frequency Measures of Central Tendency Measures of Dispersion

Mean Range

Interquartile Range
Mode

Standard Deviation
Median
Variance
Explain the measures of central tendency in brief.

i. Mean :-
The most common and effective numeric measure of the "center" of a set of data is the mean.
It is the sum of all the observations divided by the sample size.
i) Arithmetic mean :-
The arithmetic mean is simply obtained by adding all the values and then dividing the sum by the total number of
values.
Ii) Harmonic mean :-
The harmonic mean is calculated as the number of values N divided by the sum of the reciprocal of the values (1 over
each value).
iii) Geometric Mean :-
A geometric mean is a mean or average which shows the central tendency of a set of numbers by using the product
of their values.
ii. Median :-
It is the middle value of data.
It is the value that separates the higher half of a data set from the lower half.
It splits the data in half and also called 50th percentile.
If the number of elements in the dataset is odd, the middle element is the median.
If the number of elements in the dataset is even, the median would be the average of two central elements.
Let us calculate the median of marks obtained by 10 students in a quiz: 8, 3, 7, 6, 9, 10, 5, 7, 8, 5.
We first arrange them in increasing order 3,5,5,6,7,7,8,8,9,10. Since there are even number of elements, we take the
average (7+7) of the middle two values, i.e., 2 = 7.
Advantages :-
For skewed (asymmetric) data, a better measure of the center of data is the median.
Dis-advantages :-
The median is expensive to compute when we have a large number of observations.

iii. Mode :-
The mode is another measure of central tendency.
It is the value that occurs more frequently in a dataset.
It is possible for several different values to have the maximum frequency, which results in more than one mode.
Data sets with one, two, or three modes are respectively called unumodal, bimodal, and trimodal.
Advantages :-
1. It can be determined for qualitative and quantitative attributes.
2. It is not affected by extreme values.
Disadvantages :-
1. Mode is not applicable for further statistical analysis and algebraic calculation.
 Define Inferential Statistics. List its categories.
Statistical inference is a method of making decisions about the parameters of a population, based on random
sampling.
Statistical inference mainly deals with two different kinds of problem-hypothesis testing and estimation of
parameter values
Hypothesis testing is used to check whether a stated hypothesis is accepted or rejected.
Hypothesis testing can be classified as parametric tests and non-parametric tests.
There can be two hypotheses the null hypothesis (Ha) and the alternative hypothesis (H₂).
The Type I error occurs in hypothesis testing when a true null hypothesis is rejected
The Type Il error occurs in hypothesis testing when we accept a false null hypothesis.
Parametric tests are those tests for which we have prior knowledge of the population distribution
Important parametric tests are z-test, t-test, ANOVA and chi-square test.
Non-parametric tests are used when information about the population is unknown and hence no assumptions can
be made regarding the population.
The task of estimation of parameter values involves making inferences from a given sample about an unknown
population parameter.
Parameter estimation methods are point estimates or interval estimates.

Inferential Statistics

Hypothesis testing Parameter estimation

Parametric Point Estimate

Non-Parametric Interval Estimate

 What is Hypothesis Testing ?


Hypothesis testing is an important inferential statistics technique that is widely used in data science.
A hypothesis is a testable claim and only either Ho or Ha can be proved.
The process of proving either of them is called the "Hypothesis Testing Process".
if we accept Ho, Ha is automatically rejected and vice-versa.

 What is Outlier?
An outlier is a data point that differs significantly from other observations.
 Explain the measures of Dispersion in brief.

i. Range :-
The range of the set is the difference between the largest ( max() ) and the smallest ( min() ) values.
In simple terms, it is the difference between the largest and smallest.value in the set.
Range = Max – Min

II. Standard Deviation :-


Standard deviation is found by finding the square rool of the sum of squared deviation from the mean divided by the
number of observations in a given dataset

iii.Variance :-
Variance is calculated by finding the square of the standard deviation of given data distribution
Variance measures how far a data set is spread out.
It is mathematically defined as the average of the squared differences from the mean.

IV. Interquartile Range :-


The interquariile range is calculated by finding the difference between the third quartile and the first quartile.
Interquartile Range =Q3 – Q2

 Explain the methods of parameter estimation


i) Point Estimate :-
Point estimators are functions that are used to find an approximate value of a population parameter from random
samples of the population.
They use the sample data of a population to calculate a point estimate or a statistic that serves as the best estimate
of an unknown parameter of a population.
ii) Interval Estimate :-
The interval estimation of a population parameter considers two values between which the population parameter is
likely to lie.
The two values allow to set the interval range within which the parameter value of a population has the probability
of occurring.
 Describe Data Matrix vs Dissimilarity Matrix.
Data Matrix :-
Data matrix (or object-by-attribute structure):
This structure stores the n data objects in the form of a relational table, or n X p matrix (n-objects, p-attributes).
Each row corresponds to an object.
Dissimilarity Matrix :-
It is also called object-by-object structure.
This structure stores a collection of proximities that are available for all pairs of n objects.
It is often represented by an n x n table.
where d(i,j) is the measured dissimilarity or "difference" between objects i and j.
d( i, j ) is a non-negative number that is close to 0 when objects i and j are highly similar other,
and 1 when they are highly dissimilar.

 Explain the Types of Outliers.


Outlier can be of three types global, contextual, collective.
Global :-
A global outlier is a data point that strongly deviates from all the rest of the data points in the dataset.
Contextual :-
A data point is considered a contextual outlier if its value deviates significantly from the rest of the data points in the
same context.
Collective :-
A collection of data points that is anomalous with respect to the entire data set is a collective outlier.

 Explain the Outlier Detection Methods.


Outlier detection can be done using Supervised, Unsupervised and Semi-supervised methods and statistical
methods
Supervised methods :-
Supervised methods model data normality and abnormality.
Domain experts examine and label a sample of the underlying data by identifying normal data and outlier data.
Outlier detection can then be modeled as a classification problem.
The model is then trained to identify outliers on the labeled data.

Unsupervised :-
In some application scenarios, objects labeled as “normal” or "outlier" are not available.
Thus, an unsupervised learning method has to be used.
Unsupervised outlier detection methods make an implicit assumption that the normal objects are somewhat
"clustered."
Semi-supervised :- Semi-supervised outlier detection methods use the available labeled normal objects together
with unlabeled objects that are close by, to train a model for normal objects. The model of normal objects then can
be used to detect outliers those

You might also like