Statistical Data Science

Uploaded by

Omkar Shinde

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Statistical Data Science

Uploaded by

Omkar Shinde

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

 Define Statistical Data Analysis.

i) Statistical data analysis is the Collection & intpreting of data in order to uncover pattern of trends
ii) It is component of data Analysis.
iii) It is the science of collecting exploring & preventing large amount of data to discover under lying pattern of
trends.
 What is the role of Statistics in Data Science?

i.Data Exploration:-

Basic statistical descriptions can be used to learn more about each feature such as mean (average value), median ,
and mode , central tendency, which give us an idea of the "middle" or center of distribution.

ii.Data Cleaning:-

Knowing basic statistics regarding each feature makes it easier to fill in missing values, smooth noisy values, finding
outliers can also help in fixing inconsistencies incurred during data integration, and filling missing values.

iii.Data Transformation :-

This requires data sampling and feature selection methods, data transforms, scaling, normalization, and encoding.

iv. Probability Distribution and Estimation :-

Probability distribution and estimation are required in machine learning and algorithms.

v. Data Visualization :-

Plotting the measures of central tendency shows as if data are symmetric or skewed.
Plots such as quantile plots, histograms, scatter ploe treemaps, correlation heat maps
and other data visualizations types give much more powerful insights than plain data
and it also makes the data more readable and interesting.
 Define Descriptive Statistics. List its categories.

Descriptive statistics provides ways for describing, presenting, summarizing, and organizing the data, either through
numerical calculations or graphs or tables.
It helps us to organize, represent and describe data using tables, graphs, and summary measures.

Descriptive Statistics

Measures of Frequency Measures of Central Tendency Measures of Dispersion

Mean Range

Interquartile Range
Mode

Standard Deviation
Median
Variance
Explain the measures of central tendency in brief.

i. Mean :-
The most common and effective numeric measure of the "center" of a set of data is the mean.
It is the sum of all the observations divided by the sample size.
i) Arithmetic mean :-
The arithmetic mean is simply obtained by adding all the values and then dividing the sum by the total number of
values.
Ii) Harmonic mean :-
The harmonic mean is calculated as the number of values N divided by the sum of the reciprocal of the values (1 over
each value).
iii) Geometric Mean :-
A geometric mean is a mean or average which shows the central tendency of a set of numbers by using the product
of their values.
ii. Median :-
It is the middle value of data.
It is the value that separates the higher half of a data set from the lower half.
It splits the data in half and also called 50th percentile.
If the number of elements in the dataset is odd, the middle element is the median.
If the number of elements in the dataset is even, the median would be the average of two central elements.
Let us calculate the median of marks obtained by 10 students in a quiz: 8, 3, 7, 6, 9, 10, 5, 7, 8, 5.
We first arrange them in increasing order 3,5,5,6,7,7,8,8,9,10. Since there are even number of elements, we take the
average (7+7) of the middle two values, i.e., 2 = 7.
Advantages :-
For skewed (asymmetric) data, a better measure of the center of data is the median.
Dis-advantages :-
The median is expensive to compute when we have a large number of observations.

iii. Mode :-
The mode is another measure of central tendency.
It is the value that occurs more frequently in a dataset.
It is possible for several different values to have the maximum frequency, which results in more than one mode.
Data sets with one, two, or three modes are respectively called unumodal, bimodal, and trimodal.
Advantages :-
1. It can be determined for qualitative and quantitative attributes.
2. It is not affected by extreme values.
Disadvantages :-
1. Mode is not applicable for further statistical analysis and algebraic calculation.
 Define Inferential Statistics. List its categories.
Statistical inference is a method of making decisions about the parameters of a population, based on random
sampling.
Statistical inference mainly deals with two different kinds of problem-hypothesis testing and estimation of
parameter values
Hypothesis testing is used to check whether a stated hypothesis is accepted or rejected.
Hypothesis testing can be classified as parametric tests and non-parametric tests.
There can be two hypotheses the null hypothesis (Ha) and the alternative hypothesis (H₂).
The Type I error occurs in hypothesis testing when a true null hypothesis is rejected
The Type Il error occurs in hypothesis testing when we accept a false null hypothesis.
Parametric tests are those tests for which we have prior knowledge of the population distribution
Important parametric tests are z-test, t-test, ANOVA and chi-square test.
Non-parametric tests are used when information about the population is unknown and hence no assumptions can
be made regarding the population.
The task of estimation of parameter values involves making inferences from a given sample about an unknown
population parameter.
Parameter estimation methods are point estimates or interval estimates.

Inferential Statistics

Hypothesis testing Parameter estimation

Parametric Point Estimate

Non-Parametric Interval Estimate

 What is Hypothesis Testing ?

Hypothesis testing is an important inferential statistics technique that is widely used in data science.
A hypothesis is a testable claim and only either Ho or Ha can be proved.
The process of proving either of them is called the "Hypothesis Testing Process".
if we accept Ho, Ha is automatically rejected and vice-versa.

 What is Outlier?
An outlier is a data point that differs significantly from other observations.
 Explain the measures of Dispersion in brief.

i. Range :-
The range of the set is the difference between the largest ( max() ) and the smallest ( min() ) values.
In simple terms, it is the difference between the largest and smallest.value in the set.
Range = Max – Min

II. Standard Deviation :-

Standard deviation is found by finding the square rool of the sum of squared deviation from the mean divided by the
number of observations in a given dataset

iii.Variance :-
Variance is calculated by finding the square of the standard deviation of given data distribution
Variance measures how far a data set is spread out.
It is mathematically defined as the average of the squared differences from the mean.

IV. Interquartile Range :-

The interquariile range is calculated by finding the difference between the third quartile and the first quartile.
Interquartile Range =Q3 – Q2

 Explain the methods of parameter estimation

i) Point Estimate :-
Point estimators are functions that are used to find an approximate value of a population parameter from random
samples of the population.
They use the sample data of a population to calculate a point estimate or a statistic that serves as the best estimate
of an unknown parameter of a population.
ii) Interval Estimate :-
The interval estimation of a population parameter considers two values between which the population parameter is
likely to lie.
The two values allow to set the interval range within which the parameter value of a population has the probability
of occurring.
 Describe Data Matrix vs Dissimilarity Matrix.
Data Matrix :-
Data matrix (or object-by-attribute structure):
This structure stores the n data objects in the form of a relational table, or n X p matrix (n-objects, p-attributes).
Each row corresponds to an object.
Dissimilarity Matrix :-
It is also called object-by-object structure.
This structure stores a collection of proximities that are available for all pairs of n objects.
It is often represented by an n x n table.
where d(i,j) is the measured dissimilarity or "difference" between objects i and j.
d( i, j ) is a non-negative number that is close to 0 when objects i and j are highly similar other,
and 1 when they are highly dissimilar.

 Explain the Types of Outliers.

Outlier can be of three types global, contextual, collective.
Global :-
A global outlier is a data point that strongly deviates from all the rest of the data points in the dataset.
Contextual :-
A data point is considered a contextual outlier if its value deviates significantly from the rest of the data points in the
same context.
Collective :-
A collection of data points that is anomalous with respect to the entire data set is a collective outlier.

 Explain the Outlier Detection Methods.

Outlier detection can be done using Supervised, Unsupervised and Semi-supervised methods and statistical
methods
Supervised methods :-
Supervised methods model data normality and abnormality.
Domain experts examine and label a sample of the underlying data by identifying normal data and outlier data.
Outlier detection can then be modeled as a classification problem.
The model is then trained to identify outliers on the labeled data.

Unsupervised :-
In some application scenarios, objects labeled as “normal” or "outlier" are not available.
Thus, an unsupervised learning method has to be used.
Unsupervised outlier detection methods make an implicit assumption that the normal objects are somewhat
"clustered."
Semi-supervised :- Semi-supervised outlier detection methods use the available labeled normal objects together
with unlabeled objects that are close by, to train a model for normal objects. The model of normal objects then can
be used to detect outliers those

Advanced Statistics in Quantitative Research
No ratings yet
Advanced Statistics in Quantitative Research
21 pages
Analysis of A Data Set
No ratings yet
Analysis of A Data Set
9 pages
Psyc 2F23 - Stat Notes
No ratings yet
Psyc 2F23 - Stat Notes
8 pages
Decriptive Statistics in Data Science
No ratings yet
Decriptive Statistics in Data Science
9 pages
chapter2-statistical analysis
No ratings yet
chapter2-statistical analysis
86 pages
lecture-1
No ratings yet
lecture-1
72 pages
ADSEXP_1
No ratings yet
ADSEXP_1
6 pages
3-4-RESEARCH-8-2
No ratings yet
3-4-RESEARCH-8-2
54 pages
Statistics, Statistical Modelling & Data Analytics
No ratings yet
Statistics, Statistical Modelling & Data Analytics
68 pages
SSM & Da All Unit Notes
No ratings yet
SSM & Da All Unit Notes
152 pages
Notes Data Analytics
No ratings yet
Notes Data Analytics
19 pages
DA Practical Lab 02 Statistical Functions
No ratings yet
DA Practical Lab 02 Statistical Functions
6 pages
Marketing Ii: Facultad de Economía y Negocios Universidad de Chile
No ratings yet
Marketing Ii: Facultad de Economía y Negocios Universidad de Chile
18 pages
What Is Correlation Analysis
No ratings yet
What Is Correlation Analysis
73 pages
Descriptive Analytics Notes
No ratings yet
Descriptive Analytics Notes
6 pages
It Is Also Including Hypothesis Testing and Sampling
No ratings yet
It Is Also Including Hypothesis Testing and Sampling
12 pages
05 - Statistical Processing and Analysis of Medical Data
No ratings yet
05 - Statistical Processing and Analysis of Medical Data
14 pages
Quantitative Skills 2 Data Analysis (1)
No ratings yet
Quantitative Skills 2 Data Analysis (1)
43 pages
Statistics
No ratings yet
Statistics
21 pages
Unit 3 Evaluation of Analytical Data Ii: Structure
No ratings yet
Unit 3 Evaluation of Analytical Data Ii: Structure
29 pages
Lecture 1
No ratings yet
Lecture 1
32 pages
Statistics For Communication Research
No ratings yet
Statistics For Communication Research
48 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
Predictive Analytics Notes1
No ratings yet
Predictive Analytics Notes1
37 pages
Psychology Project
No ratings yet
Psychology Project
14 pages
Unit 1 - Business Statistics & Analytics
No ratings yet
Unit 1 - Business Statistics & Analytics
25 pages
Descriptive & Inferential Statistics
No ratings yet
Descriptive & Inferential Statistics
6 pages
FDS CH 2
No ratings yet
FDS CH 2
2 pages
Statistics - The Big Picture
No ratings yet
Statistics - The Big Picture
4 pages
5412-1
No ratings yet
5412-1
13 pages
CN 6
No ratings yet
CN 6
2 pages
Statistics Theory Notes
No ratings yet
Statistics Theory Notes
21 pages
E-Book On Essentials of Business Analytics: Group 7
No ratings yet
E-Book On Essentials of Business Analytics: Group 7
6 pages
DSBDL Asg 3 Write Up
No ratings yet
DSBDL Asg 3 Write Up
6 pages
DeMeasure of central tendency and dispersion
No ratings yet
DeMeasure of central tendency and dispersion
15 pages
Descriptive Statisitics Sir Eric
No ratings yet
Descriptive Statisitics Sir Eric
46 pages
Data Analysis_part 1-2
No ratings yet
Data Analysis_part 1-2
33 pages
Statistics
No ratings yet
Statistics
4 pages
Mathematics in The Modern World
No ratings yet
Mathematics in The Modern World
13 pages
Statistical Techniques in Scientific Research: Statistics
No ratings yet
Statistical Techniques in Scientific Research: Statistics
17 pages
Statistics
No ratings yet
Statistics
33 pages
ML UNIT-3
No ratings yet
ML UNIT-3
18 pages
Unit II
No ratings yet
Unit II
18 pages
Introduction To Statistics in IB Math 11
No ratings yet
Introduction To Statistics in IB Math 11
8 pages
All The Statistical Concept You Required For Data Science
No ratings yet
All The Statistical Concept You Required For Data Science
26 pages
Project Management Methodology-Batch - 17082020-7AM
No ratings yet
Project Management Methodology-Batch - 17082020-7AM
81 pages
PGDISM Assignments 05 06
No ratings yet
PGDISM Assignments 05 06
12 pages
BUSD2027 QualityMgmt Module2
No ratings yet
BUSD2027 QualityMgmt Module2
168 pages
E Book - Unit 4
No ratings yet
E Book - Unit 4
12 pages
parc6
No ratings yet
parc6
3 pages
14 - Chapter 7 PDF
No ratings yet
14 - Chapter 7 PDF
39 pages
Biostatistics Notes-numbered
No ratings yet
Biostatistics Notes-numbered
21 pages
Define Statistics
No ratings yet
Define Statistics
89 pages
Descriptive Statistics - Practical1
No ratings yet
Descriptive Statistics - Practical1
12 pages
ML-UNIT1
No ratings yet
ML-UNIT1
15 pages
DS Chapter - 2
No ratings yet
DS Chapter - 2
73 pages
Module For R-3
No ratings yet
Module For R-3
14 pages
Data Science & Machine Learning Algorithms - A CONCISEtasets, and Free Text Books) - Ananthu S Chakravarthi
100% (3)
Data Science & Machine Learning Algorithms - A CONCISEtasets, and Free Text Books) - Ananthu S Chakravarthi
90 pages
Overview Of Bayesian Approach To Statistical Methods: Software
From Everand
Overview Of Bayesian Approach To Statistical Methods: Software
Vinaitheerthan Renganathan
No ratings yet
Descriptive Statistics: Six Sigma Thinking, #3
From Everand
Descriptive Statistics: Six Sigma Thinking, #3
Sumeet Savant
No ratings yet
Veronica Valcu - Z-Score Indicator
No ratings yet
Veronica Valcu - Z-Score Indicator
5 pages
Word Uji Validitas 2
No ratings yet
Word Uji Validitas 2
7 pages
MMW Module 4.1 - Statistics - Measure of Central Tendency
100% (1)
MMW Module 4.1 - Statistics - Measure of Central Tendency
3 pages
Q4-Measures of Position Cot 2023
100% (4)
Q4-Measures of Position Cot 2023
72 pages
Chapter 4 Correlational Analysis
100% (2)
Chapter 4 Correlational Analysis
13 pages
Standard+Deviation+ Ungrouped+Data+ +
No ratings yet
Standard+Deviation+ Ungrouped+Data+ +
18 pages
ADM SHS StatProb Q3 M15 Computing for the Parameter and Statistic
No ratings yet
ADM SHS StatProb Q3 M15 Computing for the Parameter and Statistic
27 pages
1 Cumulative Frequency Curves
No ratings yet
1 Cumulative Frequency Curves
19 pages
Statistics and Probability Module 6: Week 6: Third Quarter
No ratings yet
Statistics and Probability Module 6: Week 6: Third Quarter
7 pages
Elementary Statistics: Larson Farber
No ratings yet
Elementary Statistics: Larson Farber
44 pages
Mathematical
No ratings yet
Mathematical
14 pages
Mathematics P2 Grade 11 Nov 2016 Eng & Afr Memo
No ratings yet
Mathematics P2 Grade 11 Nov 2016 Eng & Afr Memo
20 pages
Pivot Table Exercise
No ratings yet
Pivot Table Exercise
5 pages
Grouped Data
No ratings yet
Grouped Data
6 pages
Case Study DBM Maths - 3
No ratings yet
Case Study DBM Maths - 3
11 pages
PROBLEM SET 4 Continuous Probability Solutions
No ratings yet
PROBLEM SET 4 Continuous Probability Solutions
8 pages
StatProb Lesson 9
No ratings yet
StatProb Lesson 9
28 pages
Md. Mahbubul Islam II 10164004 Abrar Hossain Mozumder II 10164039 Md. Rabiul Islam II
No ratings yet
Md. Mahbubul Islam II 10164004 Abrar Hossain Mozumder II 10164039 Md. Rabiul Islam II
25 pages
Lesson 2.3-Measures of Central Tendency
No ratings yet
Lesson 2.3-Measures of Central Tendency
31 pages
Peran Penyuluh Pertanian Sebagai Komunikator Dalam Penerapan Usaha Pertanian Lahan Sempit Di Desa Hukurila Kotamadya Ambon
No ratings yet
Peran Penyuluh Pertanian Sebagai Komunikator Dalam Penerapan Usaha Pertanian Lahan Sempit Di Desa Hukurila Kotamadya Ambon
7 pages
Arip Rahman Sudrajat
No ratings yet
Arip Rahman Sudrajat
12 pages
460319_Viggo Andyga Ranjana_Tugas Mandiri 1A
No ratings yet
460319_Viggo Andyga Ranjana_Tugas Mandiri 1A
9 pages
LS3 Modules With Worksheets (Mean, Median, Mode and Range)
100% (2)
LS3 Modules With Worksheets (Mean, Median, Mode and Range)
18 pages
Pearson Correlation Coefficient
No ratings yet
Pearson Correlation Coefficient
12 pages
IB DP Mathematics AA Topic 4 Syllabus
No ratings yet
IB DP Mathematics AA Topic 4 Syllabus
3 pages
QTT Project 2 2023
No ratings yet
QTT Project 2 2023
16 pages
STA 211 JKK
No ratings yet
STA 211 JKK
11 pages
Class 12 b.stats-OK
No ratings yet
Class 12 b.stats-OK
3 pages
Active FuelData08
No ratings yet
Active FuelData08
17 pages