0% found this document useful (0 votes)
181 views20 pages

Data Science Module1

The document provides an introduction to the module on data science and visualization. It discusses what data science is, the hype around big data and data science, why data is needed now due to datafication, and the current landscape of data science perspectives and skill sets. It also includes sample questions and handouts on key topics.

Uploaded by

Likhitha B
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
181 views20 pages

Data Science Module1

The document provides an introduction to the module on data science and visualization. It discusses what data science is, the hype around big data and data science, why data is needed now due to datafication, and the current landscape of data science perspectives and skill sets. It also includes sample questions and handouts on key topics.

Uploaded by

Likhitha B
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 20

DATA SCIENCE AND VISUALIZATION(21CS644)

Module 1- Introduction to Data Science


Introduction: What is Data Science? Big
Data and Data Science hype – and getting
past the hype, Why now? – Datafication,
Module 1 Syllabus Current landscape of perspectives, Skill sets
needed

Statistical Inference: Populations and


samples, Statistical modelling, probability
distributions, fitting a model.

Handouts for Session 1: What is data Science


1.0 Data Science
 Data science is an interconnected field that involves the use of statistical and
computational methods to extract insightful information and knowledge from
data.
All data disciplines in a nutshell
 Data science is the broad scientific study that focuses on making sense of
data.Ex: Release of movie on Christmas or New year.
 Data mining is commonly a part of the data science pipeline. But unlike the
latter, data mining is more about techniques and tools used to unfold patterns
in data that were previously unknown and make data more usable for analysis.
 Ex: finding the study of movie released and the profit gained in two years.
 Machine learning aims at training machines on historical data so that they
can process new inputs based on learned patterns without explicit
programming, meaning without manually written out instructions for a system
to do an action.
 Deep learning is the most hyped branch of machine learning that uses
complex algorithms of deep neural networks that are inspired by the way the
human brain works. DL models can draw accurate results from large volumes
of input data without being told which data characteristics to look at.
 Ex: Imagine you need to determine which Movie generate positive online
reviews on your website and which cause the negative ones. In this case, deep
neural nets can extract meaningful characteristics from reviews and perform
sentiment analysis.
 Artificial intelligence is a complex topic. But for the sake of simplicity, let’s
say that any real-life data product can be called AI.

PREPARED BY DEPARTMENT OF CSE 1


DATA SCIENCE AND VISUALIZATION(21CS644)

• Let's stay with our fishing-inspired example. You want to buy a certain model
fishing rod but you only have a picture of it and don't know the brand name. An
AI system is a software product that can examine your image and provide
suggestions as to a product name and shops where you can buy it. To build an AI
product you need to use data mining, machine learning, and sometimes deep
learning.

Questions:
1.Define Data Science.
2.Explain Data Disciplines in detail with example.
3.Define Machine Learning
4.Define Data Mining.
5.Define AI with example.
Handouts for Session 2: Big Data and Data Science Hype
1.1 Big Data and Data Science Hype
Data science is an interconnected field that involves the use of statistical and
computational methods to extract insightful information and knowledge from dataThe
hype surrounding big data and data science has been quite significant in recent years, and
it's not without reason. So, what is eyebrow-raising about Big Data and data science?
Let’s count the ways:

1. There’s a lack of definitions around the most basic terminology. What is big
data? What does data science mean? What is the relationship between big data
and data science? Is data Science, the science of big data? Is data science the
only the stuff happening in big tech companies? Is big data referred to as cross

PREPARED BY DEPARTMENT OF CSE 2


DATA SCIENCE AND VISUALIZATION(21CS644)

discipline(astronomy, finance, tech etc) and Data science takes place only in
tech? How big is Big in Big Data? In short there is a lot of ambiguity!
2. There is a distinct lack of respect for researchers in academia and industry labs
who have been working on this kind of data for years and whose work is based
on decades or centuries of work by statisticians, computer scientists,
mathematicians, engineers and scientists of all types. There is a lot of Media
Hype about this topic which makes it seem like Machine Learning algorithms
were just discovered last week and data was never big until Google came
along. This is not true. Many techniques and methods we are using now and
the challenges we are facing are a part of the evolution of everything that’s
come before.
3. The hype is crazy—people throw around tired phrases straight out of the
height of the pre-financial crisis era like “Masters of the Universe” to describe
data scientists, and that doesn’t bode well.There are new and exciting things
happening as well, but one must respect the things that came before and lead
to what is happening today. But the unreal hype has just increased the Noise to
Signal Ratio! The longer the hype goes on, the more of us will get turned off
by it and this will lead to people missing out the good benefits of data science
under all the hype! The terms have lost their basic meaning and now are too
ambiguous, thus today they seem meaningless.
4. Statisticians already feel that they are working on the Science of Data and are
having a sense of Identity Theft. However Data Science is not just a
Rebranding of statistics or machine learning. It is a field by itself, unlike how
media makes it sound like it is just statistics or machine learning in the
industry context.
5. Data Science may not be science as most people say, but it definitely is more
of a craft!

1.2 Getting Past the Hype

 When you transition from academia to industry, you realize there is a gap between
things learned in college and what you do on the job – Industry-Academia Gap.
Why does it have to be that way? Why are academic courses out of touch with
reality?
 General Experience of a data scientist is that at their job they have access to a
larger body of knowledge and methodology as well as a process – Data Science

PREPARED BY DEPARTMENT OF CSE 3


DATA SCIENCE AND VISUALIZATION(21CS644)

Process which has foundations in both statistics and computer science. It is


something new, fragile and a nascent idea which is at a real risk of being rejected
prematurely due to all the unrealistic hype

1.3 Why Now data is required?

 There is availability of massive amount of data from many aspects of our lives.
There is abundance of inexpensive computing power.All our activities online such
as Shopping, communication, news Consumption, Music preferences, Search
records, expression of opinions are tracked.
 There is Datafication of our offline behaviour as well such as Finance Data,
Medical Data, Bioinformatics and social welfare. From the online and offline data
put together there’s a lot to learn about behavior and we are as a species There is a
growing influence of data in most sectors and most industries, In some cases the
amount collected is considered to be BIG and in some cases it is not.
 The massiveness of data makes it very interesting as well as challenging. Often
data itself becomes the building blocks of data products – Amazon
Recommendation systems, Facebook Friend recommendation, Movie
recommendation on Netflix, Music Recommendation on Spotify.
 In Finance it is Credit Ratings, Trading Algorithms, Risk Assessment etc. In E-
Learning – Dynamic personalized learning and assessments – Khan Academy and
Knewton. In Government it is creation of Policies based on Data. This is the
beginning of a culturally saturated feedback loop

 Behavioral Data of people changes the product. The Product changes the behavior
of the people. Large scale data processing, increased memory and bandwidth and
cultural acceptance of technology in our lives makes it possible unlike a decade
ago. Considering the impact of this feedback loop, we should seriously start
PREPARED BY DEPARTMENT OF CSE 4
DATA SCIENCE AND VISUALIZATION(21CS644)

thinking how it is being conducted along with the ethical and technical
responsibilities for the people responsible for the process.

Questions

1.Explain the Reasons for Data Science Hype

2.Explain in detail the requirement of data

Handouts for Session 3: Datafication and Current Landscape


1.4 Datafication

 Datafication is a process of taking all aspects of life and turning them into data –
Kenneth Neil Cukier and Viktor Mayer Schoenberger (Rise of Big Data, 2013).
 Everything we do online or otherwise ends up recorded for later examination in
data storage units or for sale – Facebook isn’t free.
 We are the product. Google’s AR Glasses Datafies gaze, Twitter Datafies Stray
Thoughts LinkedIN Datafies Professional Networks. Consider the importance of
datafication with respect to people’s intentions about sharing their own data. We
are being datafied.
 Our actions are being datafied.The spectrum of this ranges from us gleefully
taking part in social media experiment we are proud of to all out surveillance and
stalking. When we “Like” something online, we are intending to be datafied or at
least we should expect to be When we browse the web, we are unintentionally or
at least passively being datafied via cookies we may or may not be aware of.
 When we are walking around on the streets, we are being datafied by Cameras.
Once we datafy things, we can transform their purpose and turn the information
into new forms of value.
 Who is “We”? It is usually modelers and entrepreneurs profiting from getting
people to buy stuff. What kind of “Value”? Increased efficiency through
automation.

1.5 The Current Landscape – Skillsets Needed


 Data science, as it’s practiced, is a blend of Red-Bull-fueled hacking and
espresso-inspired statistics. But data science is not merely hacking—because
when hackers finish debugging their Bash one-liners and Pig scripts, few of them
careabout non-Euclidean distance metrics And data science is not merely

PREPARED BY DEPARTMENT OF CSE 5


DATA SCIENCE AND VISUALIZATION(21CS644)

statistics, because when statisticians finish theorizing the perfect model, few
could read a tab-delimited file into R if their job depended on it.Data science is
the civil engineering of data.

 Its acolytes possess a practical knowledge of tools and materials, coupled with a
theoretical understanding of what’s possible.
 So, the statement is essentially saying that while hackers may be proficient at
writing code and solving technical problems, they may not necessarily have the
depth of knowledge or interest in the mathematical and statistical concepts that are
crucial to data science.
 While statisticians may excel in theoretical aspects of data analysis, they may lack
the programming skills necessary to handle real-world data effectively.
 In summary, while statistics is an important component of data science, data
science encompasses a broader set of skills and activities beyond statistical
analysis, including programming, data manipulation, and machine learning.

Questions:
1.Explain the process of datafication in details.
2. Explain the Current landscape of DataScience Process

Handouts for Session 4: Skill Sets


1.6 Data Science Profile
Expertise in the following fields is a requirement
 Computer Science: Data science relies heavily on programming and
computational tools to manipulate, analyze, and visualize data. Proficiency in
programming languages like Python, R, or SQL is essential for data acquisition,
PREPARED BY DEPARTMENT OF CSE 6
DATA SCIENCE AND VISUALIZATION(21CS644)

cleaning, and analysis. Additionally, knowledge of algorithms and data structures


enables efficient processing of large datasets.
 Math: Mathematics forms the foundation of data science. Concepts from
calculus, linear algebra, and discrete mathematics are used for understanding and
implementing machine learning algorithms, statistical modeling, and
optimization techniques.
 Statistics: Statistics provides the framework for making inferences and
predictions from data. Understanding probability theory, hypothesis testing,
regression analysis, and sampling methods is crucial for analyzing data, assessing
model performance, and drawing meaningful conclusions.
 Machine Learning: Machine learning algorithms are at the heart of data science,
enabling systems to learn from data and make predictions or decisions. Data
scientists need to understand various machine learning techniques, such as
supervised learning (e.g., linear regression, decision trees), unsupervised learning
(e.g., clustering, dimensionality reduction), and deep learning (e.g., neural
networks).
 Communication and Presentation Skills: Data scientists must effectively
communicate their findings and insights to stakeholders, which requires strong
verbal and written communication skills. They should be able to translate
complex technical concepts into layman's terms and craft compelling narratives.
Presentation skills involve creating visually appealing and informative
presentations or reports to convey the results of data analyses.
 Data Visualization: Visualizing data is essential for exploring patterns, trends,
and relationships within datasets. Data scientists use tools like matplotlib,
ggplot2, or Tableau to create meaningful visualizations that aid in understanding
and interpreting data. Effective data visualization enhances communication and
decision-making processes.
 Extensive Domain Expertise: Domain knowledge is critical for contextualizing
data and understanding the specific challenges and opportunities within a
particular industry or field. Data scientists with extensive domain expertise can
identify relevant variables, interpret results in the appropriate context, and
develop actionable insights tailored to the needs of stakeholders.
 In summary, proficiency in computer science, math, statistics, machine learning,
communication, data visualization, and domain expertise are all essential
requirements for success in the field of data science.

PREPARED BY DEPARTMENT OF CSE 7


DATA SCIENCE AND VISUALIZATION(21CS644)

 As we mentioned earlier, a data science team works best when different skills
(profiles) are represented across different people, because nobody is good at
everything. It makes us wonder if it might be more mworthwhile to define a
“data science team”—as shown in Figure 1-3—than to define a data scientist.

PREPARED BY DEPARTMENT OF CSE 8


DATA SCIENCE AND VISUALIZATION(21CS644)

1.7 What is Data Scientist


 Here we discuss regarding how data scientist is defined in terms of academics
and industry.
In Academia:
 For the term “data science” to catch on in academia at the level of the faculty,
and as a primary title, the research area needs to be more formally
defined. An academic data scientist is a scientist, trained in any of the academic
fields, who works with large amounts of data, and must grapple with
computational problems posed by the structure, size, messiness, and the
complexity and nature of the data, while simultaneously solving a real world
problem.
In Industry:
 A chief data scientist sets the data strategy of the company, which involves a
variety of things: setting everything up from the engineering and infrastructure
for collecting data and logging, to privacy concerns, to deciding what data will be
user-facing, how data is going to be used to make decisions, and how it’s going
to be built back into the product. They will also manage a team of engineers,
scientists and analysts and communicate with leadership across the company
including the CEO, CTO, and product leadership. They should also be concerned
with patenting innovative solutions and setting research goals.They will be
concerned with patenting innovative solutions and setting research goals. In a
general sense a Data scientist is someone:
 More generally, a data scientist is someone who knows how to extract meaning
from and interpret data, which requires both tools and methods from statistics
and machine learning, as well as being human. She spends a lot of time in the
process of collecting, cleaning, and munging data, because data is never clean.
This process requires persistence, statistics, and software engineering skills—
skills that are also necessary for understanding biases in the data, and for
debugging logging output from code.
 Once they get the data into shape, a crucial part is exploratory data analysis,
which combines visualization and data sense. They will find patterns, build
models, and algorithms—some with the intention of understanding product usage
and the overall health of the product, and others to serve as prototypes that
ultimately get baked back into the product. They may design experiments, and
which is a critical part of data driven decision making. They will communicate

PREPARED BY DEPARTMENT OF CSE 9


DATA SCIENCE AND VISUALIZATION(21CS644)

with team members, engineers, and leadership in clear language and with data
visualizations so that even if her colleagues are not immersed in the data
themselves, they will understand the implications.

Questions:

1.Explain the Skill Sets needed for the Data Scientist Profile.

2. Explain the Data Scientist in terms of Academics and Industry in detail.

Handouts for Session 5: Statistical Inference


1.8 Statistical Inferencing

 As we commute to work on subways and in cars, as our blood moves through


our bodies, as we’re shopping, emailing, procrastinating at work by browsing
the Internet and watching the stock market, as we’re building things, eating
things, talking to our friends and family about things, while factories are
producing products, this all at least potentially produces data.
 Imagine spending 24 hours looking out the window, and for every minute,
counting and recording the number of people who pass by. Or gathering up
everyone who lives within a mile of your house and making them tell you how
many email messages they receive every day for the next year. Imagine heading
over to your local hospital and rummaging around in the blood samples looking
for patterns in the DNA. That all sounded creepy, but it wasn’t supposed to. The
point here is that the processes in our lives are actually data-generating
processes.
 Data represents the traces of the real-world processes, and exactly which traces
we gather are decided by our data collection or sampling method. You, the data
scientist, the observer, are turning the world into data.
 Once you have all this data, you have somehow captured the world, or certain
traces of the world. But you can’t go walking around with a huge Excel
spreadsheet or database of millions of transactions and look at it and, with a
snap of a finger, understand the world and process that generated it. So you need
a new idea, and that’s to simplify those captured traces into something more
comprehensible, to something that somehow captures it all in a much more
concise way, and that something could be mathematical models or functions of
the data, known as statistical estimators. This overall process of going from the

PREPARED BY DEPARTMENT OF CSE 10


DATA SCIENCE AND VISUALIZATION(21CS644)

world to the data, and then from the data back to the world, is the field of
statistical inference.
 It is a discipline that is concerned with the development of procedures, methods
and theorems that allows the extraction of meaning and information from the
data generated by stochastic processes.

World Data World – Statistical Inferencing


Questions:
1.Explain Statistical Inference with Example.

Handouts for Session 6: Population and Samples


1.9 Population and Samples

 Population is any set of objects or units (tweets, photographs, stars). If the


characteristics of all the objects can be measured or extracted, then it is a
complete set of observations (N).
 A single observation can include a list of characteristics of the object. A subset of
the units of size (n) considered to make observations and draw conclusions and
make inferences about the population is called a sample. Taking a subset may
introduce biases into the data and distort it.
 Example: Suppose you are conducting research on smartphone usage habits
among teenagers in a specific city. Your population comprises all teenagers aged
13-18 living in that city, which could number in the tens of thousands. Due to
logistical constraints and the difficulty of reaching every teenager in the city, you
opt to use a sample of 500 teenagers randomly selected from different schools
within the city. This sample will participate in interviews or surveys to provide
insights into their smartphone usage patterns, preferences, and behaviors.
 Population refers to the total set of observation. say, you are looking for the
average height of men. Here population is the set of all men in the world.

PREPARED BY DEPARTMENT OF CSE 11


DATA SCIENCE AND VISUALIZATION(21CS644)

1.10 Population and Samples of Big Data

 In the age of big data, we still need to take sample because sampling solves some
engineering problems. How much data is needed depends on the goal. For
analysis or inferencing there is no need to store the data all the time. For serving
purpose you may need it all the time in order to render correct information.
 Bias: If a sample of data is observed, it may have an inherent bias in it and the
data may be representative of only that subset and not of the entire population,
thereby any conclusion or inference drawn from it should not be extended to the
entire population. – Tweet Pre-Hurricane Sandy and Post-Hurricane Sandy.
 If the tweets immediately before hurricane sandy is analyzed, one would infer
that most people went supermarket shopping.If the tweets immediately after
hurricane sandy is analyzed, one would infer that most people went partyingMost
tweets were from New Yorkers, who are heavy tweeters and not from New
Jerseyans. Coastal New Jerseyans were worried about house collapsing etc. and
did not have time to tweet
 If only the limited data was studied, the only conclusion one would draw is what
the Hurricane Sandy was like for a subset of twitter users (who are not a
representative of the whole population) and would infer that the hurricane was
not that bad

 Types of Data
 Traditional – numerical, categorical and binary
 Text – Emails, Tweetys, Reviews, News Articles
PREPARED BY DEPARTMENT OF CSE 12
DATA SCIENCE AND VISUALIZATION(21CS644)

 Records – User-Level Data, Timestamped event data, JSON-


Formatted Log Files
 Geo-Based Location Data – Housing Data
 Network
 Sensor Data
 Images
 New Data requires new strategies for sampling. If a Facebook user-level data
aggregated from timestamped event logs is analyzed for a week, can any
conclusions be drawn that is relevant next week or year? How to sample From a
network and preserve the complex network structure? Many of these questions are
open research questions

1.11 BIG Data


 BIG is a moving target – When the size of the data becomes a challenge we refer
to it as big. BIG is when you cannot fit all the data on one machine. BIG data is a
cultural phenomenon. It is characterized by The Vs – Volume, Variety, Velocity,
Value, Validity and Veracity.

Elements of Big Data

 Volume – Data Measured in terms of petabytes and exabytes (1mn TB), made
possible by reduction in cost of storage devices

 Velocity – The fast arrival speed of data and increase in data volume. Powered by
IoT and High Speed Internet

 Variety – Form – Many forms of data – Text, graph, audio, video maps,
composite (Video with audio). Function – Human Conversations, Transaction
records, old archive data. Source of Data – Open/public data, social media data,
multimodal data

 Veracity – Aspect like conformity to the facts, truthfulness, believability, and


confidence in data – Error sources – technical, typographical and human

 Validity – Accuracy of the data for talking decisions or for any other goals

 Value - the value of the information that is extracted from the data and its
influence on the decisions that are taken based on it.

1.12 Big Data Can Mean Big Assumptions

PREPARED BY DEPARTMENT OF CSE 13


DATA SCIENCE AND VISUALIZATION(21CS644)

 Big Data revolution consists of three things:

 Collecting and using a lot of data rather than small samples

 Accepting messiness in your data

 Giving up on knowing the causes

 N=all very often is not a good assumption and misses the things we should
consider the most.
 For Example: Election Day Polls – Even if we poll everyone who leaves the
polling stations, we still don’t count people who decided not to vote. And these
maybe the people we need to understand the voting problems! Recommendations
received on Netflix may not be good because people who bother to rate the shows
may different taste, leading to a skew in the recommendation system towards the
taste of the people who rated.
 Data is not Objective. Data does not speak for itself. Example: Algorithm for
hiring – Consider an organization that did not treat female employees well. So
when deciding to compare men and women with same qualifications, data showed
that women tend to leave more often, get promoted less often and give more
negative feedback on the environment than men. The automated model based on
this data will likely hire a man over a woman if a man and a woman with the same
qualification turned up for the interview. Ignoring Causation can be a flaw rather
than a feature and add to historical problems rather than address them. Data is just
a quantitative representation of the events of our society
 The n=1 ,assumption is considered for Sample size of 1.For a single person, we
can actually record a lot of information We might even sample from all the actions
they took in order to make inferences about them This is used in User-Level
Modeling.

Questions:
1.Explain Population with example
2.Explain Sample with example
3.Explain the different types of data in population and sampling of big data.
4.Explain in detail 5 elements of big data.
5.Explain the Big Data Revolution in detail.

PREPARED BY DEPARTMENT OF CSE 14


DATA SCIENCE AND VISUALIZATION(21CS644)

Handouts for Session 7: Modeling


1.13 Modeling
 Humans try to understand the world around them by representing it in different
ways called Models. Statisticians and data scientists capture the uncertainty and
randomness of data-generating processes with mathematical functions that express
the shape and structure of the data itself. A model is an attempt to understand and
represent the nature of reality through a particular lens, be it architectural,
biological, or mathematical. It is an artificial construction where all extraneous
detail has been removed or abstracted. Attention must always be paid to these
abstracted details after a model has been analyzed to see what might have been
overlooked.
 In the case of a statistical model, we may have mistakenly excluded key variables,
included irrelevant ones, or assumed a mathematical structure which is far from
reality.
 A Model is an attempt to understand the population of interest and represent that
in a compact form which can be used to experiment, analyze, study and determine
cause-and-effect and similar relationships amongst the variables under study in the
population.

1.14 Statistical Modeling


 Statistical Modeling is an expression of relationship such as What comes first?
What influences what? What causes what? What’s a test of that? in terms of
mathematical expressions that will be general enough that they have to include
parameters, but the parameter values are not yet known.
 Other people prefer pictures and will first draw a diagram of data flow, possibly
with arrows, showing how things affect other things or what happens over time.
This gives them an abstract picture of the relationships before choosing equations
to express them.

 For example: If there are two columns x and y of data and there is a linear
relationship between them then we can represent it as y= (a x + b).Where a and b
are parameters whose values are not yet known. Below figure depicts the
Prediction of salary based on the experience of employee

PREPARED BY DEPARTMENT OF CSE 15


DATA SCIENCE AND VISUALIZATION(21CS644)

How to build a model?

 To build a model we start with Exploratory Data Analysis (EDA) which includes
making plots, building intuition for a particular dataset. It involves plotting
histograms and looking at scatter plots to get a feel for the data. Representative
functions are written down.
 It start with a simple linear function and see if it makes sense. If it does not make
sense then understand why and see what representative function would make more
sense and keep building up the complexity (Eg: Go Parabolic after Linear).Write
down complete sentences and try to express the words as equations & code.
Simple plots may be easier to interpret and understand.
 A Trade off may usually be required during Modeling. A Simple model may get
you 90% of the way & may take few hours to build, whereas a complex model
may get you up to 92% and may take months to build.
 Example

PREPARED BY DEPARTMENT OF CSE 16


DATA SCIENCE AND VISUALIZATION(21CS644)

Questions:

1.Define Modeling.Explain how to build a model with example.

2.Expalin statistical Modeling with Example

Handouts for Session 6: Probability Distributions and Fitting a Model


1.14 Probability Distributions

 Probability Distributions are foundations of statistical models.


 Probability distributions are fundamental concepts in statistics and probability
theory. In the context of data science, understanding probability distributions is
crucial for modeling and analyzing data.
 Probability Distribution: A probability distribution describes the likelihood of
each possible outcome of a random variable. It assigns probabilities to different
values that the variable can take.
 Example – Normal (Gaussian) Distribution, Poisson Distribution, Weibull
Distribution, Gamma Distribution, Exponential Distribution. Natural Processes
tend to generate measurements whose empirical shape could be approximated by
mathematical functions with a few parameters that could be estimated from the
data. Not all processes generate data that looks like a named distribution, but
many do. These functions can be as building blocks of our models.
PREPARED BY DEPARTMENT OF CSE 17
DATA SCIENCE AND VISUALIZATION(21CS644)

 It’s beyond the scope of the book to go into each of the distributions in detail, but
we provide them in Figure below as an illustration of the various common shapes,
and to remind you that they only have names because someone observed them
enough times to think they deserved names. There is actually an infinite number
of possible distributions. They are to be interpreted as assigning a probability to a
subset of possible outcomes, and have corresponding functions. For example, the
normal distribution is written as:

Normal Distribution

(x−μ)2
1 −
• N(x| μ, σ) = σ e 2σ2
√2π

• μ is mean

• σ is standard Deviation

 Data tends to be around a central value with no bias on left or right. It is a


Symmetric Distribution appearing as a Bell-shaped curve distribution. Mean and
Median Controls where the distribution is centered. σ controls how spread out the
distribution is. This is the general function form. Specific real-world phenomenon
have actual numbers as value which can be estimated from data.
 A random variable denoted by x or y can be assumed to have a corresponding
probability distribution p(x) which maps to a positive real number. In order to be a
probability density function, we’re restricted to the set of functions such that if we
integrate p(x) to get the area under the curve, it is 1, so it can be interpreted as
probability.

PREPARED BY DEPARTMENT OF CSE 18


DATA SCIENCE AND VISUALIZATION(21CS644)

 Example: Let x be the amount of time until the next bus arrives. x is a random
variable because there is variation and uncertainty in the amount of time until the
next bus. Suppose we know that the time until the next bus has a probability
density function of p(x) = 2e−2x . If we want to know the likelihood of the next bus
arriving in between 12 and 13 minutes, then we find the area under the curve
13
between 12 and 13 by∫12 2e−2x How do we know that the distribution is correct?
We can conduct an experiment where we show up at the bus stop at a random
time, measure how much time until the next bus, and repeat this experiment over
and over again. Then we look at the measurements, plot them, and approximate
the function. Because we are familiar with the fact that “waiting time” is a
common enough real-world phenomenon that a distribution called the exponential
distribution has been invented to describe it, we know that it takes the form p(x) =
λe−λx .
 Joint probability is the probability of two events occurring simultaneously.
Example: washing the car and raining
 Conditional probability is the probability of one event occurring in the presence
of a second event
 .Example: Total there are 2 Blue marble and 3 Red Marble, If a blue marble was
selected first there is now a 1/4 chance of getting a blue marble and a 3/4 chance
of getting a red marble. If a red marble was selected first there is now a 2/4 chance
of getting a blue marble and a 2/4 chance of getting

1.15 Fitting a Model

 Fitting a model means estimate the parameters of the model using the observed
data. The data is used as evidence to help approximate the real world-
mathematical process that generated the data. A good model fit refers to a model
that accurately approximates the output when it is provided with unseen inputs.
 Fitting the model often involves optimization methods and algorithms such as
maximum likelihood estimation, to help get the parameters. When you estimate
the parameters, they are actually estimators, meaning they themselves are
functions of the data.
 Fitting the model is when you start actually coding: your code will read in the
data, and you’ll specify the functional form that you wrote down on the piece of
paper.

PREPARED BY DEPARTMENT OF CSE 19


DATA SCIENCE AND VISUALIZATION(21CS644)

 Then R or Python will use built-in optimization methods to give you the most
likely values of the parameters given the data. Initially you should have an
understanding that optimization is taking place and how it works, but you don’t
have to code this part yourself—it underlies the R or Python functions.
 The process involves running an algorithm on data for which the target variable
(“labeled” data) is known to produce a machine learning model. Then, the model’s
outcomes are compared to the real, observed values of the target variable to
determine the accuracy.
 Overfitting is the term used to mean that you used a dataset to estimate the
parameters of your model, but your model isn’t that good at capturing reality
beyond your sampled data. Overfitting occurs when a model learns the training
data too well, including its noise and outliers, to the extent that it performs poorly
on unseen data
Causes of Overfitting:
 Complex Models: Models with too many parameters relative to the size of the
training data can capture noise instead of underlying patterns.
 Insufficient Data: When the amount of training data is limited, complex models
may find patterns where none exist due to randomness.
 Feature Overfitting: Including irrelevant features or too many features in the
model can lead to overfitting.
 Lack of Regularization: Regularization techniques, such as L1 and L2
regularization, help prevent overfitting by penalizing overly complex models.

Questions

1.Explain Probability Distribution with example

2.Explain Overfitting and causes of overfitting with example

PREPARED BY DEPARTMENT OF CSE 20

You might also like