Data Science Module1
Data Science Module1
• Let's stay with our fishing-inspired example. You want to buy a certain model
fishing rod but you only have a picture of it and don't know the brand name. An
AI system is a software product that can examine your image and provide
suggestions as to a product name and shops where you can buy it. To build an AI
product you need to use data mining, machine learning, and sometimes deep
learning.
Questions:
1.Define Data Science.
2.Explain Data Disciplines in detail with example.
3.Define Machine Learning
4.Define Data Mining.
5.Define AI with example.
Handouts for Session 2: Big Data and Data Science Hype
1.1 Big Data and Data Science Hype
Data science is an interconnected field that involves the use of statistical and
computational methods to extract insightful information and knowledge from dataThe
hype surrounding big data and data science has been quite significant in recent years, and
it's not without reason. So, what is eyebrow-raising about Big Data and data science?
Let’s count the ways:
1. There’s a lack of definitions around the most basic terminology. What is big
data? What does data science mean? What is the relationship between big data
and data science? Is data Science, the science of big data? Is data science the
only the stuff happening in big tech companies? Is big data referred to as cross
discipline(astronomy, finance, tech etc) and Data science takes place only in
tech? How big is Big in Big Data? In short there is a lot of ambiguity!
2. There is a distinct lack of respect for researchers in academia and industry labs
who have been working on this kind of data for years and whose work is based
on decades or centuries of work by statisticians, computer scientists,
mathematicians, engineers and scientists of all types. There is a lot of Media
Hype about this topic which makes it seem like Machine Learning algorithms
were just discovered last week and data was never big until Google came
along. This is not true. Many techniques and methods we are using now and
the challenges we are facing are a part of the evolution of everything that’s
come before.
3. The hype is crazy—people throw around tired phrases straight out of the
height of the pre-financial crisis era like “Masters of the Universe” to describe
data scientists, and that doesn’t bode well.There are new and exciting things
happening as well, but one must respect the things that came before and lead
to what is happening today. But the unreal hype has just increased the Noise to
Signal Ratio! The longer the hype goes on, the more of us will get turned off
by it and this will lead to people missing out the good benefits of data science
under all the hype! The terms have lost their basic meaning and now are too
ambiguous, thus today they seem meaningless.
4. Statisticians already feel that they are working on the Science of Data and are
having a sense of Identity Theft. However Data Science is not just a
Rebranding of statistics or machine learning. It is a field by itself, unlike how
media makes it sound like it is just statistics or machine learning in the
industry context.
5. Data Science may not be science as most people say, but it definitely is more
of a craft!
When you transition from academia to industry, you realize there is a gap between
things learned in college and what you do on the job – Industry-Academia Gap.
Why does it have to be that way? Why are academic courses out of touch with
reality?
General Experience of a data scientist is that at their job they have access to a
larger body of knowledge and methodology as well as a process – Data Science
There is availability of massive amount of data from many aspects of our lives.
There is abundance of inexpensive computing power.All our activities online such
as Shopping, communication, news Consumption, Music preferences, Search
records, expression of opinions are tracked.
There is Datafication of our offline behaviour as well such as Finance Data,
Medical Data, Bioinformatics and social welfare. From the online and offline data
put together there’s a lot to learn about behavior and we are as a species There is a
growing influence of data in most sectors and most industries, In some cases the
amount collected is considered to be BIG and in some cases it is not.
The massiveness of data makes it very interesting as well as challenging. Often
data itself becomes the building blocks of data products – Amazon
Recommendation systems, Facebook Friend recommendation, Movie
recommendation on Netflix, Music Recommendation on Spotify.
In Finance it is Credit Ratings, Trading Algorithms, Risk Assessment etc. In E-
Learning – Dynamic personalized learning and assessments – Khan Academy and
Knewton. In Government it is creation of Policies based on Data. This is the
beginning of a culturally saturated feedback loop
Behavioral Data of people changes the product. The Product changes the behavior
of the people. Large scale data processing, increased memory and bandwidth and
cultural acceptance of technology in our lives makes it possible unlike a decade
ago. Considering the impact of this feedback loop, we should seriously start
PREPARED BY DEPARTMENT OF CSE 4
DATA SCIENCE AND VISUALIZATION(21CS644)
thinking how it is being conducted along with the ethical and technical
responsibilities for the people responsible for the process.
Questions
Datafication is a process of taking all aspects of life and turning them into data –
Kenneth Neil Cukier and Viktor Mayer Schoenberger (Rise of Big Data, 2013).
Everything we do online or otherwise ends up recorded for later examination in
data storage units or for sale – Facebook isn’t free.
We are the product. Google’s AR Glasses Datafies gaze, Twitter Datafies Stray
Thoughts LinkedIN Datafies Professional Networks. Consider the importance of
datafication with respect to people’s intentions about sharing their own data. We
are being datafied.
Our actions are being datafied.The spectrum of this ranges from us gleefully
taking part in social media experiment we are proud of to all out surveillance and
stalking. When we “Like” something online, we are intending to be datafied or at
least we should expect to be When we browse the web, we are unintentionally or
at least passively being datafied via cookies we may or may not be aware of.
When we are walking around on the streets, we are being datafied by Cameras.
Once we datafy things, we can transform their purpose and turn the information
into new forms of value.
Who is “We”? It is usually modelers and entrepreneurs profiting from getting
people to buy stuff. What kind of “Value”? Increased efficiency through
automation.
statistics, because when statisticians finish theorizing the perfect model, few
could read a tab-delimited file into R if their job depended on it.Data science is
the civil engineering of data.
Its acolytes possess a practical knowledge of tools and materials, coupled with a
theoretical understanding of what’s possible.
So, the statement is essentially saying that while hackers may be proficient at
writing code and solving technical problems, they may not necessarily have the
depth of knowledge or interest in the mathematical and statistical concepts that are
crucial to data science.
While statisticians may excel in theoretical aspects of data analysis, they may lack
the programming skills necessary to handle real-world data effectively.
In summary, while statistics is an important component of data science, data
science encompasses a broader set of skills and activities beyond statistical
analysis, including programming, data manipulation, and machine learning.
Questions:
1.Explain the process of datafication in details.
2. Explain the Current landscape of DataScience Process
As we mentioned earlier, a data science team works best when different skills
(profiles) are represented across different people, because nobody is good at
everything. It makes us wonder if it might be more mworthwhile to define a
“data science team”—as shown in Figure 1-3—than to define a data scientist.
with team members, engineers, and leadership in clear language and with data
visualizations so that even if her colleagues are not immersed in the data
themselves, they will understand the implications.
Questions:
1.Explain the Skill Sets needed for the Data Scientist Profile.
world to the data, and then from the data back to the world, is the field of
statistical inference.
It is a discipline that is concerned with the development of procedures, methods
and theorems that allows the extraction of meaning and information from the
data generated by stochastic processes.
In the age of big data, we still need to take sample because sampling solves some
engineering problems. How much data is needed depends on the goal. For
analysis or inferencing there is no need to store the data all the time. For serving
purpose you may need it all the time in order to render correct information.
Bias: If a sample of data is observed, it may have an inherent bias in it and the
data may be representative of only that subset and not of the entire population,
thereby any conclusion or inference drawn from it should not be extended to the
entire population. – Tweet Pre-Hurricane Sandy and Post-Hurricane Sandy.
If the tweets immediately before hurricane sandy is analyzed, one would infer
that most people went supermarket shopping.If the tweets immediately after
hurricane sandy is analyzed, one would infer that most people went partyingMost
tweets were from New Yorkers, who are heavy tweeters and not from New
Jerseyans. Coastal New Jerseyans were worried about house collapsing etc. and
did not have time to tweet
If only the limited data was studied, the only conclusion one would draw is what
the Hurricane Sandy was like for a subset of twitter users (who are not a
representative of the whole population) and would infer that the hurricane was
not that bad
Types of Data
Traditional – numerical, categorical and binary
Text – Emails, Tweetys, Reviews, News Articles
PREPARED BY DEPARTMENT OF CSE 12
DATA SCIENCE AND VISUALIZATION(21CS644)
Volume – Data Measured in terms of petabytes and exabytes (1mn TB), made
possible by reduction in cost of storage devices
Velocity – The fast arrival speed of data and increase in data volume. Powered by
IoT and High Speed Internet
Variety – Form – Many forms of data – Text, graph, audio, video maps,
composite (Video with audio). Function – Human Conversations, Transaction
records, old archive data. Source of Data – Open/public data, social media data,
multimodal data
Validity – Accuracy of the data for talking decisions or for any other goals
Value - the value of the information that is extracted from the data and its
influence on the decisions that are taken based on it.
N=all very often is not a good assumption and misses the things we should
consider the most.
For Example: Election Day Polls – Even if we poll everyone who leaves the
polling stations, we still don’t count people who decided not to vote. And these
maybe the people we need to understand the voting problems! Recommendations
received on Netflix may not be good because people who bother to rate the shows
may different taste, leading to a skew in the recommendation system towards the
taste of the people who rated.
Data is not Objective. Data does not speak for itself. Example: Algorithm for
hiring – Consider an organization that did not treat female employees well. So
when deciding to compare men and women with same qualifications, data showed
that women tend to leave more often, get promoted less often and give more
negative feedback on the environment than men. The automated model based on
this data will likely hire a man over a woman if a man and a woman with the same
qualification turned up for the interview. Ignoring Causation can be a flaw rather
than a feature and add to historical problems rather than address them. Data is just
a quantitative representation of the events of our society
The n=1 ,assumption is considered for Sample size of 1.For a single person, we
can actually record a lot of information We might even sample from all the actions
they took in order to make inferences about them This is used in User-Level
Modeling.
Questions:
1.Explain Population with example
2.Explain Sample with example
3.Explain the different types of data in population and sampling of big data.
4.Explain in detail 5 elements of big data.
5.Explain the Big Data Revolution in detail.
For example: If there are two columns x and y of data and there is a linear
relationship between them then we can represent it as y= (a x + b).Where a and b
are parameters whose values are not yet known. Below figure depicts the
Prediction of salary based on the experience of employee
To build a model we start with Exploratory Data Analysis (EDA) which includes
making plots, building intuition for a particular dataset. It involves plotting
histograms and looking at scatter plots to get a feel for the data. Representative
functions are written down.
It start with a simple linear function and see if it makes sense. If it does not make
sense then understand why and see what representative function would make more
sense and keep building up the complexity (Eg: Go Parabolic after Linear).Write
down complete sentences and try to express the words as equations & code.
Simple plots may be easier to interpret and understand.
A Trade off may usually be required during Modeling. A Simple model may get
you 90% of the way & may take few hours to build, whereas a complex model
may get you up to 92% and may take months to build.
Example
Questions:
It’s beyond the scope of the book to go into each of the distributions in detail, but
we provide them in Figure below as an illustration of the various common shapes,
and to remind you that they only have names because someone observed them
enough times to think they deserved names. There is actually an infinite number
of possible distributions. They are to be interpreted as assigning a probability to a
subset of possible outcomes, and have corresponding functions. For example, the
normal distribution is written as:
Normal Distribution
(x−μ)2
1 −
• N(x| μ, σ) = σ e 2σ2
√2π
• μ is mean
• σ is standard Deviation
Example: Let x be the amount of time until the next bus arrives. x is a random
variable because there is variation and uncertainty in the amount of time until the
next bus. Suppose we know that the time until the next bus has a probability
density function of p(x) = 2e−2x . If we want to know the likelihood of the next bus
arriving in between 12 and 13 minutes, then we find the area under the curve
13
between 12 and 13 by∫12 2e−2x How do we know that the distribution is correct?
We can conduct an experiment where we show up at the bus stop at a random
time, measure how much time until the next bus, and repeat this experiment over
and over again. Then we look at the measurements, plot them, and approximate
the function. Because we are familiar with the fact that “waiting time” is a
common enough real-world phenomenon that a distribution called the exponential
distribution has been invented to describe it, we know that it takes the form p(x) =
λe−λx .
Joint probability is the probability of two events occurring simultaneously.
Example: washing the car and raining
Conditional probability is the probability of one event occurring in the presence
of a second event
.Example: Total there are 2 Blue marble and 3 Red Marble, If a blue marble was
selected first there is now a 1/4 chance of getting a blue marble and a 3/4 chance
of getting a red marble. If a red marble was selected first there is now a 2/4 chance
of getting a blue marble and a 2/4 chance of getting
Fitting a model means estimate the parameters of the model using the observed
data. The data is used as evidence to help approximate the real world-
mathematical process that generated the data. A good model fit refers to a model
that accurately approximates the output when it is provided with unseen inputs.
Fitting the model often involves optimization methods and algorithms such as
maximum likelihood estimation, to help get the parameters. When you estimate
the parameters, they are actually estimators, meaning they themselves are
functions of the data.
Fitting the model is when you start actually coding: your code will read in the
data, and you’ll specify the functional form that you wrote down on the piece of
paper.
Then R or Python will use built-in optimization methods to give you the most
likely values of the parameters given the data. Initially you should have an
understanding that optimization is taking place and how it works, but you don’t
have to code this part yourself—it underlies the R or Python functions.
The process involves running an algorithm on data for which the target variable
(“labeled” data) is known to produce a machine learning model. Then, the model’s
outcomes are compared to the real, observed values of the target variable to
determine the accuracy.
Overfitting is the term used to mean that you used a dataset to estimate the
parameters of your model, but your model isn’t that good at capturing reality
beyond your sampled data. Overfitting occurs when a model learns the training
data too well, including its noise and outliers, to the extent that it performs poorly
on unseen data
Causes of Overfitting:
Complex Models: Models with too many parameters relative to the size of the
training data can capture noise instead of underlying patterns.
Insufficient Data: When the amount of training data is limited, complex models
may find patterns where none exist due to randomness.
Feature Overfitting: Including irrelevant features or too many features in the
model can lead to overfitting.
Lack of Regularization: Regularization techniques, such as L1 and L2
regularization, help prevent overfitting by penalizing overly complex models.
Questions