Introduction
Data Science
Swati Chopade
VJTI, Mumbai
September 14, 2022
Swati Chopade Data Science
Introduction Modelling Data
Outline
1 Introduction
Modelling Data
Swati Chopade Data Science
Introduction Modelling Data
Modelling Data
Modelling data
let us deal with some kind of patient data where we have
dierent readings about patients, these could be blood sugar
level, cholesterol level and so on.
Let’s say blood sugar level is stored under the column named
‘x1’ and we have sorted the data in the ascending order.
Swati Chopade Data Science
Introduction Modelling Data
Data Modelling
Swati Chopade Data Science
Introduction Modelling Data
Modelling Data
Modelling data
This data is sort of clustered around the region which is close
to 117 and we have some smaller values and some very big
values.so, this data looks like normal distribution.
Swati Chopade Data Science
Introduction Modelling Data
Data Modelling
Swati Chopade Data Science
Introduction Modelling Data
Modelling Data
Statistical Modelling: Underlying data distribution
For example, we are dealing with patient data where you are
having dierent reading of patient such as, blood sugar level,
pulse rates, colesterol level. For example, suppose there is a
new drug in the market for certain medical conditions such as
suger level, colesterol level.
As a doctor, he is interested in knowing what is the
eectiveness of this new drug or how eective this drug is? To
know answer for this question and do the experiment
thoroughly, it is required to to randomized control experiment.
You take sample of patients means you take some subjects,
you are interested in administering this drug for the sample of
patients. In statistical modelling, we assume simple statistical
models which allows robust statistical analysis and give
statistical guarantees (ρ-values, goodness-of-t tests)
Swati Chopade Data Science
Introduction Modelling Data
Data Modelling
Figure: Data Modelling
Swati Chopade Data Science
Introduction Modelling Data
Data Modelling
Figure: Data Modelling
Swati Chopade Data Science
Introduction Modelling Data
Modelling Data
Statistical Modelling: Underlying data distribution
administer this new drug for a time duration of say 3 months
and we note down the readings again and say it again follows
a normal distribution as depicted in the below image and now
we can see that the mean has shifted towards the lower side:
Swati Chopade Data Science
Introduction Modelling Data
Data Modelling
Figure: Data Modelling
Swati Chopade Data Science
Introduction Modelling Data
Modelling Data
Statistical Modelling: Underlying data distribution
The data scientist says that the data follows a normal
distribution.So, mean and variance are sufficient to describe
this data. We are looking for a robust argument such as I am
99% sure that the drug is eective.
You have to nd out the underlying relationships of the data
such as what is the relationship between the blood sugar level
and number of days of treatment? The data scientist can say
that there is a liner relationship between the blood sugar leval
and no of days. It is decreasing as the number of days are
inctreasing.
The data scientist should say that I am 99% sure that the
sugar level drops by 3 points for each day of the treatment.
This robust statement is used to advocate whether this drug
is used or not?
Swati Chopade Data Science
Introduction Modelling Data
Data Modelling
Figure: Data Modelling
Swati Chopade Data Science
Introduction Modelling Data
Data Modelling
Figure: Data Modelling
Swati Chopade Data Science
Introduction Modelling Data
Data Modelling
Figure: Data Modelling
Swati Chopade Data Science
Introduction Modelling Data
Data Modelling
Figure: Data Modelling
Swati Chopade Data Science
Introduction Modelling Data
Modelling Data
Algorithmic Modelling
Alternative approach to modelling is algorithmic modelling. It
is loosely machine learning modelling. In statistical modelling,
we made very simplied models of data. But, due to statistical
guarantees you are limited to the models you can use. You
can not use a very very complex models such as we can not
say the relationship between input and output is log of cube
of sin of e raise to x. I can not do statistical analysis on it.
But, in a real world, the relationship is much more complex
and depends on many factors which we are not considering. In
such cases, we need the alternative approach; build the
complex models.
Swati Chopade Data Science
Introduction Modelling Data
Algorithmic Data Modelling
Figure: Algorithm Modelling
Swati Chopade Data Science
Introduction Modelling Data
Modelling Data
Algorithmic Modelling
Machine learning allows a large family of very very complex
functions. This allows to model the relationships with very
very complex functions.
What is the goal of machine learning? The goal is to estimate
the function f using data and optimization techniques. ML
allows to choose very complex functions to represent the
relationships between the variables of the data. For a new
patint, plug-in the value of x (age, weight, blood-pressure) to
get y. The focus in on the prediction (don’t care about the
underlying phenomena). I am not interested in knowing that
how much blood sugar depends on age, weight, blood
pressure. Finally, what answer I get should be very close to
the true answer. Prediction should be very very true.
Swati Chopade Data Science
Introduction Modelling Data
Algorithmic Modelling
Figure: Algorithmic Modelling
Swati Chopade Data Science
Introduction Modelling Data
Algorithmic Modelling
Figure: Algorithmic Modelling
Swati Chopade Data Science
Introduction Modelling Data
Algorithmic Modelling
Figure: Algorithmic Modelling
Swati Chopade Data Science
Introduction Modelling Data
Dierence between Statistical Modelling and Algorithmic
Modelling
Figure: Algorithmic Modelling
Swati Chopade Data Science
Introduction Modelling Data
Statistical Modelling
Figure: Statistical Modelling
Swati Chopade Data Science
Introduction Modelling Data
Algorithmic Modelling
Swati Chopade Data Science
Introduction Modelling Data
Modelling Data
Algorithmic Modelling: DL
When you have large amounts of high dimensional data and
you want to learn very comples relationships between the input
and the output use a specic class of complex ML models and
algorithms collectively referred to as Deep learning.
consider the image of retina which is 256*256 and you want
to predict whether the patient is suering from diabetic
retinopathy. Use Deep learning. Why DL popular? A large
amount of data available in many scenarios in the form of
text, speech, image and the relationships between the
variables are very complex.
Good software frameworks such as pyTorch and much better
computers.
Swati Chopade Data Science
Introduction Modelling Data
Algorithmic Modelling
Figure: Algorithmic Modelling
Swati Chopade Data Science
Introduction Modelling Data
Algorithmic Modelling
Figure: Algorithmic Modelling
Swati Chopade Data Science
Introduction Modelling Data
Algorithmic Modelling
Figure: Algorithmic Modelling
Swati Chopade Data Science
Introduction Modelling Data
Thank You
Swati Chopade Data Science