DataScience Unit1 (+notes)
DataScience Unit1 (+notes)
2004 Facebook
2005 Youtube
2010 Instagram
2011 Snapchat
Data Science Lifecycle
1 Business Problem
2 Data Acquisition
3 Data Preparation
4 Exploratory Data Analysis
5 Data Modeling
6 Visualization and Communication
7 Deployment and Maintainence
Business Problem
Web Servers
Logs
Databases
API’s
Online repository
Data Preparation
Data Cleaning
Data Preparation
Transformation
Exploratory Data Analysis
Tableau
PowerBI
Qlikview
Data Science Lifecycle
1 Business Problem
2 Data Acquisition
3 Data Preparation
4 Exploratory Data Analysis
5 Data Modeling
6 Visualization and Communication
7 Deployment and Maintainence
The Art of Data Science
How Big Data is Changing the Whole Equation for Business,” Wall Street
Journal March 8, 2013
Several "V"s of big data
Big Data
Big Data (Big Deal!)
Apache
Bigspark
Hadoop
Machine Learning
Each of these features can be combined in a classifier give us some evidence to find out the email is spam.
Machine Learning Methods
❖ Supervised
❖ Unsupervised
❖ Semi-supervised
❖ Reinforcement
Supervised Learning
Supervised machine learning algorithms can apply what has been learned in
the past to new data using labeled examples to predict future events. Starting from
the analysis of a known training dataset, the learning algorithm produces an
inferred function to make predictions about the output values.
❖ The system can provide targets for any new input after sufficient training.
❖ The learning algorithm can also compare its output with the correct, intended
output and find errors in order to modify the model accordingly.
An Example: Supervised Learning
4-Aug-21 24
Supervised Learning
25
Unsupervised Learning
❖ The systems that use this method can considerably improve learning
accuracy.
❖ Usually, semi-supervised learning is chosen when the acquired labeled
data requires skilled and relevant resources in order to train it / learn
from it. Otherwise, acquiring unlabeled data generally doesn’t require
additional resources.
Reinforcement Learning
Reinforcement machine learning algorithm is a learning method that interacts with its environment by
producing actions and discovers errors or rewards.
Reinforcement Learning (RL) is a type of machine learning technique that enables an agent to learn in
an interactive environment by trial and error using feedback from its own actions and experiences.
❖ Trial and error search and delayed reward are the most relevant characteristics of reinforcement
learning.
❖ This method allows machines and software agents to automatically determine the ideal behavior within
a specific context in order to maximize its performance.
❖ Reinforcement learning uses rewards and punishment as signals for positive and negative behavior.
❖ Simple reward feedback is required for the agent to learn which action is best; this is known as the
reinforcement signal.
❖ Father of Reinforcement Learning- Richard Sutton
Reinforcement Learning
It differs from other forms of supervised learning because the sample data set does not train the machine.
Instead, it learns by trial and error. Therefore, a series of right decisions would strengthen the method as it
better solves the problem.
As compared to unsupervised learning, reinforcement learning is different in terms of goals. While the
goal in unsupervised learning is to find similarities and differences between data points, in reinforcement
learning the goal is to find a suitable action model that would maximize the total cumulative reward of the
agent.
Applications
Data science is one where theories are implemented using data, some of its big
data. This is embodied in an inference stack comprising (in sequence): theories,
models, intuition, causality, prediction, and correlation.
Theories:
Theories are statements of how the world should be or is, and are derived from
axioms that are assumptions about the world, or precedent theories.
Models:
• Models are implementations of theory, and in data science are often algorithms
based on theories that are run on data.
• The academic Thomas Davenport writes that models are key, and should not be
increasingly eschewed with increasing data
Causality and Correlation
Bell-shaped curve
Gaussian Distribution
This curve is symmetric around the Mean.
Mean, Median, and Mode are all the same.
Normal distribution describes how the values of a variable
are distributed. It is typically a symmetric distribution
where most of the observations cluster around the central
peak. The values further away from the mean taper off
equally in both directions.
Poisson distribution
E-commerce
Recommend products to customers
Education
Explore current trends and find latest course as per industry need
Collect student feedback
Understand student requirements
Internet Search
Education
Take user’s query
Provide results
Show relevant recommendations Internet Search
Advertising products
Post adds on websites
Explore targeted customers and recommend products to
Recommendation
customers
Recommendation
Products, Entertainment (Videos streaming, Music)
Applications of Data Science
Predictive Modeling
Airline Companies
Intuition
Intuition:
The results of running a model leads to intuition, i.e., a deeper understanding of
the world based on theory, model, and data.
Once we have established intuition for the results of a model, it remains to be seen whether
the relationships we observe are causal, predictive, or merely correlational. Theory may be
causal and tested as such. Granger (1969) causality is often stated in mathematical form for
two stationary time series of data as follows. X is said to Granger cause Y if in the following
equation system,
𝑌 𝑡 = 𝑎1 + 𝑏1 𝑌 𝑡 − 1 + 𝑐1 𝑋 𝑡 − 1 + 𝑒1
X 𝑡 = 𝑎2 + 𝑏2 𝑌 𝑡 − 1 + 𝑐2 𝑋 𝑡 − 1 + 𝑒2
The coefficient of 𝑐1 is significant and 𝑏2 is not significant. Hence, X causes Y, but not
vice versa. Causality is a hard property to establish, even with theoretical foundation,
as the causal effect has to be well entrenched in the data.
Correlation
➢ Correlation:
Finally there is correlation, at the end of the data science inference chain.
Contemporaneous movement between two variables is quantified using
correlation. In many cases, we uncover correlation, but no prediction or causality.
Correlation has great value to firms attempting to tease out beneficial information
from big data. And even though it is a linear relationship between variables, it lays
the groundwork for uncovering nonlinear relationships, which are becoming
easier to detect with more data.
Exponentials, Logarithms, and
Compounding
• It is fitting to begin with the fundamental mathematical constant, “e =2.718281828...”, which is also
the function “exp(.)”. We often write this function as 𝑒 𝑥 , where x can be a real or complex variable.
Given y = 𝑒 𝑥 , a fixed change in x results in the same continuous percentage change in y. This is
because ln(y) = x, where ln(.) is the natural logarithm function, and is the inverse function of the
exponential function.
1 𝑛
• The constant e is defined as the limit of a specific function: lim 1 + 𝑛
𝑛→∞
• Exponential compounding is the limit of successively shorter intervals over discrete compounding.
Given a horizon t divided into n intervals per year, one dollar compounded from time zero to time t
𝑟 𝑛𝑡
years over these n intervals at per annum rate r may be written as 1 + .
𝑛
• Continuous-compounding is the limit of this equation when the number of periods n goes to
infinity:
𝑟 𝑛𝑡 1 𝑛/𝑟 𝑡𝑟
lim 1 + = lim [ 1+ ] = 𝑒 𝑟𝑡
𝑛→∞ 𝑛 𝑛→∞ 𝑛/𝑟
Normal Distribution
This distribution is the workhorse of many models in the social sciences, and is
assumed to generate much of the data that comprises the Big Data universe.
Interestingly, most phenomena (variables) in the real world are not normally
distributed. They tend to be “power law” distributed, i.e., many observations of low
value, and very few of high value. The probability distribution declines from left to right
and does not have the characteristic hump shape of the normal distribution.
we do need to learn about the normal distribution because it is important in statistics,
and the central limit theorem does govern much of the data we look at. Examples of
approximately normally distributed data are stock returns, and human heights.
If x ~ N(µ,σ 2 ), that is, x is normally distributed with mean m and variance σ 2 , then the
probability “density” function for x is:
1 1 (𝑥−𝜇)2
𝑓 𝑥 = exp[− ]
2𝜋𝜎 2 2 𝜎2
Normal Distribution (Contd.)
The Poisson is also known as the rare-event distribution. Its density function is:
𝑒 −λ λ𝑛
𝑓 𝑛, λ = 𝑛!
where there is only one parameter, i.e., the mean λ. The density function is over
discrete values of n, the number of occurrences given the mean number of
outcomes λ. The mean and variance of the Poisson distribution are both λ. The
Poisson is a discrete-support distribution, with a range of values n = {0, 1, 2,..}.
Moments of a continuous random
variable
The following formulae are useful to review because any analysis of data begins with
descriptive statistics, and the following statistical “moments” are computed in order to
get a first handle on the data. Given a random variable x with probability density
function f (x), then the following are the first four moments.
Mean (first moment or average) = 𝐸 𝑥 = 𝑥𝑑 𝑥 𝑓𝑥
In like fashion, powers of the variable result in higher (nth order) moments. These are
“non-central” moments, i.e., they are moments of the raw random variable x, not its
deviation from its mean, i.e., [x E(x)].
nth moment = 𝐸 𝑥 𝑛 = 𝑥𝑑 𝑥 𝑓 𝑛 𝑥
Central moments are moments of demeaned random variables. The second central
moment is the variance:
2=
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝑉𝑎𝑟 𝑥 = 𝐸 𝑥 − 𝐸 𝑥 𝐸 𝑥 2 − [𝐸(𝑥)] 2
Moments of a continuous random variable
(Contd.)
• The standard deviation is the square-root of the variance, i.e., 𝜎 = 𝑉𝑎𝑟(𝑥). The third central
moment, normalized by the standard deviation to a suitable power is the skewness:
𝐸[𝑥 − 𝐸(𝑥)] 3
𝑠𝑘𝑒𝑤𝑛𝑒𝑠𝑠 =
𝑉𝑎𝑟(𝑥) 3/2
• The absolute value of skewness relates to the degree of asymmetry in the probability density. If
more extreme values occur to the left than the right, the distribution is left-skewed. And vice-versa,
the distribution is right-skewed.
• Correspondingly, the fourth central, normalized moment is kurtosis.
𝐸[𝑥 − 𝐸(𝑥)] 4
𝐾𝑢𝑟𝑡𝑜𝑠𝑖𝑠 =
[𝑉𝑎𝑟(𝑥)] 2
• Kurtosis in the normal distribution has value 3. We define “Excess Kurtosis” to be Kurtosis minus 3.
When a probability distribution has positive excess kurtosis we call it “leptokurtic”. Such
distributions have fatter tails (either or both sides) than a normal distribution.
Combining random variables
• Since we often have to deal with composites of random variables, i.e., more than one random variable,
we review here some simple rules for moments of combinations of random variables. There are several
other expressions for the same equations, but we examine just a few here, as these are the ones we will
use more frequently.
• First, we see that means are additive and scalable, i.e.,
• 𝐸 𝑎𝑥 + 𝑏𝑦 = 𝑎𝐸 𝑥 + 𝑏𝐸(𝑦)
• where x, y are random variables, and a, b are scalar constants. The variance of scaled, summed random
variables is as follows:
• 𝑉𝑎𝑟 𝑎𝑥 + 𝑏𝑦 = 𝑎2 𝑉𝑎𝑟 𝑥 + 𝑏2 𝑉𝑎𝑟 𝑦 + 2𝑎𝑏𝐶𝑜𝑣 𝑥, 𝑦
• And the covariance and correlation between two random variables is
• 𝐶𝑜𝑣 𝑥, 𝑦 = 𝐸 𝑥𝑦 − 𝐸 𝑥 𝐸 𝑦
𝐶𝑜𝑣(𝑥,𝑦)
• 𝐶𝑜𝑟𝑟 𝑥, 𝑦 =
𝑉𝑎𝑟 𝑥 𝑉𝑎𝑟(𝑦)
Vector Algebra
• We will be using linear algebra in many of the models. Linear algebra requires the manipulation of
vectors and matrices. We will also use vector calculus. Vector algebra and calculus are very powerful
methods for tackling problems that involve solutions in spaces of several variables, i.e., in high
dimension.
• Rather than work with an abstract exposition, it is better to introduce ideas using an example. We’ll
examine the use of vectors in the context of stock portfolios. We define the returns for each stock in
a portfolio as:
𝑅1 1
𝑅2 1
• 𝑹= . 𝑼= .
. .
𝑅𝑁 1
• The use of this unit vector will become apparent shortly, but it will be used in myriad ways and is a
useful analytical object.
Vector Algebra (Contd.)
• A portfolio vector is defined as a set of portfolio weights, i.e., the fraction of the portfolio that is
invested in each stock:
𝑤1
𝑤2
• W= .
.
𝑤𝑁
• The total of portfolio weights must add up to 1. σ𝑁 ′
𝑖=1 𝑤𝑖 = 1, 𝑾 𝟏 = 1
• Pay special attention to the line above. In it, there are two ways in which to describe the sum of
portfolio weights. The first one uses summation notation, and the second one uses a simple vector
algebraic statement, i.e., that the transpose of w, denoted w’ times the unit vector 1 equals 1.
• The two elements on the left-hand-side of the equation are vectors, and the 1 on the right hand side
is a scalar. The dimension of w’ is (1 x N) and the dimension of 1 is (N x 1). And a (1 x N) vector
multiplied by a (N x 1) results in a (1 x 1) vector, i.e., a scalar.
Statistical Regression
• Consider a multivariate regression where a stock’s returns 𝑅𝑖 are regressed on several market
factors 𝑅𝑘 .
• 𝑅𝑖𝑡 = σ𝑘𝑗=0 𝛽𝑖𝑗 𝑅𝑗𝑡 + 𝑒𝑖𝑡 , ∀𝑖 .
• where t = {1, 2, . . . , T} (i.e., there are T items in the time series), and there are k independent
variables, and usually k = 0 is for the intercept. We could write this also as:
• 𝑅𝑖𝑡 = 𝛽0 + σ𝑘𝑗=1 𝛽𝑖𝑗 𝑅𝑗𝑡 + 𝑒𝑖𝑡 , ∀𝑖 .
• Compactly, using vector notation, the same regression may be written as: 𝑅𝑖 = 𝑅𝑘 𝛽𝑖 + 𝑒𝑖
• Where 𝑅𝑖 , 𝑒𝑖 ∈ 𝑅𝑇 , 𝑅𝑘 ∈ 𝑅 𝑇 𝑘+1 𝑎𝑛𝑑 𝛽𝑖 ∈ 𝑅 𝑘+1 . If there is an intercept in the regression then
the first column of 𝑅𝑘 is 1, the unit vector. Without providing a derivation, you should know that
each regression coefficient is:
𝐶𝑜𝑣(𝑅𝑖 ,𝑅𝑘 )
• 𝛽𝑖𝑘 =
𝑉𝑎𝑟(𝑅𝑘 )
Diversification
• It is useful to examine the power of using vector algebra with an application. Diversification occurs
when we increase the number of non-perfectly correlated stocks in a portfolio, thereby reducing
portfolio variance. In order to compute the variance of the portfolio we need to use the portfolio
weights w and the covariance matrix of stock returns R, denoted Σ. We first write down the formula
for a portfolio’s return variance:
• 𝑉𝑎𝑟 𝑤 ′ 𝑅 = 𝑤′Σw = σ𝑛𝑖=1 𝑤𝑖2 𝜎𝑖2 + σ𝑛𝑖=1 σ𝑛𝑗=1,𝑖≠𝑗 𝑤𝑖 𝑤𝑗 𝜎𝑖𝑗
• Readers are strongly encouraged to implement this by hand for n = 2 to convince themselves that
the vector form of the expression for variance w’Σw is the same thing as the long form on the right-
hand side of the equation above. If returns are independent, then the formula collapses to:
• 𝑉𝑎𝑟 𝑤 ′ 𝑅 = 𝑤′Σw = σ𝑛𝑖=1 𝑤𝑖2 𝜎𝑖2
Matrix Equations
• Here we examine how matrices may be used to represent large systems of equations easily
and also solve them. Using the values of matrices A, B and w from the previous section, we
write out the following in long form:
3 2 𝑤1 3
• Aw = B 𝑤 =
2 4 2 4
• Find the solution values 𝑤1 and 𝑤2 by hand. And then we may compute the solution for w by
“dividing” B by A. This is not regular division because A and B are matrices. Instead we need
to multiply the inverse of A (which is its “reciprocal”) by B.
• The inverse of A is
0.500 −0.250
• 𝐴−1 =
−0.250 0.375
• Now compute by hand:
0.50
• 𝐴−1 𝐵 =
0.75
Inter and Intra Cluster
Questions?
Das, S. R., & DAS, S. (2016). Data science: theories, models, algorithms, and
analytics. Learning, 143, 145.
Sharaff, A., & Sinha, G. R. (Eds.). (2021). Data Science and Its Applications. CRC
Press.
Van Der Aalst, W. (2016). Data science in action. In Process mining (pp. 3-23).
Springer, Berlin, Heidelberg.
Provost, F., & Fawcett, T. (2013). Data Science for Business: What you need to know
about data mining and data-analytic thinking. " O'Reilly Media, Inc.".