0% found this document useful (0 votes)
13 views55 pages

Data Science S3mca

Data science notes

Uploaded by

praswin70
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
13 views55 pages

Data Science S3mca

Data science notes

Uploaded by

praswin70
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 55

Data science and machine

learning
Module 1
Data science
 Data Science can be defined as the study of
data, where it comes from, what it represents,
and the ways by which it can be transformed into
valuable inputs and resources to create business
and IT strategies.

 Data Science is about finding patterns in data,


through analysis, and make future predictions.

 Data Science is about data gathering, analysis


and decision-making.
Data science
 Data science uses the most powerful hardware,
programming systems, and most efficient
algorithms to solve the data related problems. It
is the future of artificial intelligence.
 By using DS companies are able to make
 Better decisions(should we choose A or B)
 Predictive analysis(what will happen next)
 Pattern discoveries(find pattern or may be
hdden information in the data
Types of data
 Data can be categorised into 4 basic types from a
ML perspective
1.Numerical data
2.Categorical data
3.Time series data
4.Text data
• Artificial intelligence is about giving machines the capability of mimicking human behaviour. Examples would
be: facial recognition, automated driving etc.
• Machine learning can either be considered a sub-field or one of the tools of artificial intelligence, is
providing machines with the capability of learning from experience. Experience for machines comes in the
form of data. Data that is used to teach machines is called training data.
• Machine learning algorithms, also called “learners”, take both the known input and output (training data) to
figure out a model for the program which converts input to output.
OK

This is also shirt


ok

This is also
shirt
Key features and motivations:-
1.Extracting Meaningful Patterns
2.Building Representative Models
3.Combination of Statistics, Machine Learning, and
Computing
4.Learning Algorithms
DATA SCIENCE CLASSIFICATION

• Data science problems can be broadlycategorized


into
• Supervised or unsupervised learning models.
Supervised Models
• Supervised or directed data science tries to infer a function or relationship
based on labeled training data and uses this function to map new
unlabeled data.
• Supervised techniques predict the value of the output variables based on a
set of input variables.
• To do this, a model is developed from a training dataset where the values of
input and output are previously known.
• The model generalizes the relationship between the input and output
variables and uses it to predict for a dataset where only input variables
are known.
• The output variable that is being predicted is also called a class label or
target variable
Unsupervised Models
• Unsupervised or undirected data science uncovers hidden patterns in
unlabeled data.
• In unsupervised data science, there are no output variables to
predict.
• The objective of this class of data science techniques, is to find
patterns in data based on the relationship between data points
themselves
DATA SCIENCE PROCESS
The methodical discovery of useful relationships and patterns in data is enabled by

a set of iterative activities collectively known as the data science process.


1.PRIOR KNOWLEDGE


Prior knowledge refers to information that is
already known about a subject.
● This helps to define
– what problem is being solved
– how it fits in the business context
– What data is needed in order to solve the
problem
PRIOR KNOWLEDGE
Objective:Without a well-defined statement of the problem, it is impossible to come
up with the right dataset and pick the right data science algorithm.
Subject Area:The process of data science uncovers hidden patterns in the dataset by
exposing relationships between attributes. But the problem is that it uncovers a lot of
patterns. The false signals are a major problem in the data science process. . Hence,
it is essential to know the subject matter, the context, and the business process
generating the data.
Data:Similar to the prior knowledge in the subject area, prior knowledge in the data
can also be gathered. Understanding how the data is collected, stored, transformed,
reported, and used is essential to the data science process.
Terminologies

A data set(example set) is a collection of data with a defined
structure.This structure is also sometimes referred to as a “data
frame”.

A data point (record, object or example) is a single instance in the
dataset.Each row in the Table is a data point.Each instance contains
the same structure as the dataset.

An attribute (feature, input, dimension, variable, or predictor)is a
singleproperty of the dataset.Each column in the Table is an
attribute.Attributes can be numeric, categorical, date-time, text, or
Boolean data types

A label (class label, output, prediction, target, or response) is the
special attribute to be predicted based on all the input attributes

Identifiers are special attributes that are used for locating or providing
context to individual records
Causation Versus Correlation:Correlation is a statistical measure (expressed as a
number) that describes the size and direction of a relationship between two or more
variables.
Causation indicates that one event is the result of the occurrence of the other event;
i.e. there is a causal relationship between the two events. This is also referred to as
cause and effect.
Causation Versus Correlation
 Correlation means there is a relationship or pattern between the
values of two variables.
 There are three ways to describe the correlation between
variables.
 Positive correlation: As xxx increases, yyy increases.
 Negative correlation: As xxx increases, yyy decreases.
 No correlation: As xxx increases, yyy stays about the same or
has no clear pattern.

 Causation means that one event causes another event to occur.


2 .DATA PREPARATION
 Preparing the dataset to suit a data science task is the most

time-consuming part of the process.


● It is extremely rare that datasets are available in the form
required by the data science algorithms.

 Most of the data science algorithms would require data to be
structured in a tabular format with records in the rows and
● attributes in the columns.
 If the data is in any other format, the data would need to be
transformed by applying type conversion, join, or
transpose functions, etc., to condition the data into the
required structure.
2.1 Data Exploration
Data preparation starts with an in-depth exploration of the data and

gaining a better understanding of the dataset.
Data exploration, also known as exploratory data analysis,

provides a set of simple tools to achieve basic understanding of
the data.

Data exploration approaches involve computing descriptive statistics
and visualization of data.


2.2 Data Quality


Data quality is an ongoing concern wherever data is collected,
● processed, and stored.


Organizations use data alerts, cleansing, and transformation techniques to
improve and manage the quality of the data and store them in
companywide repositories called data warehouses.

Data sourced from well-maintained data warehouses have higher quality,
as there are proper controls in place to ensure a level of data accuracy for

new and existing data.
The data cleansing practices include elimination of duplicate records,

quarantining outlier records that exceed the bound, substitution of missing

values, etc.
2.3 Missing Values
One of the most common data quality issues is that some records have missing

attribute values. For example, a credit score may be missing in one of the records.


The first step of managing missing values is to understand the reason behind why
the values are missing.
● The missing value can be substituted with a range of artificial data so that the issue
can be managed.

Missing credit score values can be replaced with a credit score derived from the
dataset (mean, minimum, or maximum value, depending on the characteristics of the
● attribute).


2.4 Data Types and Conversion
The

attributes in a dataset can be of different types, such as continuous
numeric (interest rate), integer numeric (credit score), or categorical.
For
● example, the credit score can be expressed as categorical values (poor,
good, excellent) or numeric score. Different data science algorithms impose
different
● restrictions on the attribute data types.
 If the available data are categorical, they must be converted to continuous
numeric attribute. A specific numeric score can be encoded for each category value,

such as poor=400, good=600, excellent=700, etc.
Numeric values can be converted to categorical data types by a technique

called binning
2.6 Outliers

● Outliers are anomalies in a given dataset


● Detecting outliers may be the primary
purpose of some data science applications,
like fraud or intrusion detection.


2.7 Feature Selection

● Many data scienceproblems involve a dataset


with hundreds to thousands of attributes.

Not all the attributes are equally important or useful in
predicting the target.

Some of the attributes may be highly correlated with each
other, like annual income and taxes paid.
A large number of attributes in the dataset significantly
● increases the complexity of a model and may degrade the
performance of the model.
Reducing the number of attributes, without significant loss in
the performance of the model, is called feature selection.
2.8 Data Sampling
Sampling is a process of selecting a subset of records

as a representation of the original dataset for use in data
analysis or modeling.
The sample data serve as a representative of the

original dataset.
Sampling reduces the amount of data that need to be

processed and speeds up the build process of the
modeling.


3. MODELING

● A model is the abstract representation of the


data and the relationshipsin a given dataset.


3.1Training and Testing Datasets
The dataset used to create the model, with known attributes and
target, is called the training dataset.

The validity of the created model will also need to be checked with
another known dataset called the test dataset or validation dataset.

To facilitate this process, the overall known dataset can be split into a
training dataset and a test dataset.

A standard rule of thumb is two-thirds of the data are to be used as


training and one-third as a test dataset


3.2 Learning Algorithms


The business question and the availability of data
will dictate what data science task (association,
classification, regression, etc.,) can to be used.

The practitioner determines the appropriate data
science algorithm within the chosen category.

For example, within a classification task many
algorithms can be chosen from: decision trees,
neural networks, Bayesian models, k-NN, etc.
3.3 Evaluation of the Model

Model evaluation is used to test the performance of the model.
The model is tested with known records ; where these records were not
used to build the model.

The actual value of the oputput can be compared against the predicted value
using the model, and thus, the prediction error can be calculated.
As long as the error is acceptable, this model is ready for deployment.
● The error rate can be used to compare this model with other models
developed using different algorithms like neural networks or Bayesian models,
etc

3.3 Ensemble Modeling
 Ensemble modeling is a process where multiple diverse

base models are used to predict an outcome.


The motivation for using ensemble models is to reduce the

generalization error of the prediction.
 As long as the basemodels are diverse
and independent, the prediction error decreases
when the ensemble approach is used.
APPLICATION
 Deployment is the stage at which the model becomes production ready or
live.
 The model deployment stage has to deal with:
 assessing model readiness,
 technical integration,
 response time,
 model maintenance
 Assimilation
1.Production Readiness
 The production readiness part of the deployment determines the
critical qualities required for the deployment objective.
 Consider business use cases:
i) determining whether a consumer qualifies for a loan- critical quality of this
model deployment is real-time prediction
2.Technical Integration
 Currently, it is quite common to use data science automation
tools or coding using R or Python to develop models.
 It is flexible to develop the model with one tool and deploy it in
another tool or application.
3.Response Time
Data science algorithms, like k-NN, are easy to build, but quite slow at
predicting the unlabeled records.
 Algorithms such as the decision tree take time to build but are fast at
prediction.
 The quality of prediction, accessibility of input data, and the response
time of the prediction remain the critical quality factors in business
application
4.Model Refresh
It is quite normal that the conditions in which the model is built change
after the model is sent to deployment.
The validity of the model can be routinely tested by using the new
known test dataset and calculating the prediction error rate.
If the error rate exceeds a particular threshold, then the model has to be
refreshed and redeployed.
5.Assimilation
 In the descriptive data science applications, deploying a model
to live systems may not be the end objective.
 The objective may be to assimilate the knowledge gained from
the data science analysis to the organization.
 The association analysis provides a solution for the market
basket problem, where the task is to find which two products are
purchased together most often.
KNOWLEDGE
 The data science process provides a framework to extract nontrivial
information from data.
 With the advent of massive storage, increased data collection, and
advancedcomputing paradigms, the available datasets to be utilized are only
increasing.
 To extract knowledge from these massive data assets, advanced
approaches need to be employed, like data science algorithms, in
addition to standard business intelligence reporting or statistical analysis.
Though many of these algorithms can provide valuable knowledge, it is
up to the practitioner to skilfully transform a business problem to a data
problem and apply the right algorithm.
 Data science, like any other technology, provides various options in
terms of algorithms. Using these options to extract the right information
from data .
DATA EXPLORATION
 Data exploration, also known as exploratory data analysis, provides a set
of tools to obtain fundamental understanding of a dataset.
 Data exploration also provides guidance on applying the right kind of
further statistical and data science treatment.
 Data exploration can be broadly classified into two types—
1. descriptive statistics
2. data visualization.
 Descriptive statistics is the process of condensing key characteristics of the
dataset into simple numeric metrics. Some of the common quantitative
metrics used are mean, standard deviation, and correlation.
 Visualization is the process of projecting the data, or parts of it, into multi-
dimensional space.
DATA SET
 A dataset (example set) is a collection of data with a defined structure.
Types of the data
 Data come in different formats and types.
DESCRIPTIVE STATISTICS
 Descriptive statistics refers to the study of the aggregate quantities of a
dataset.
 These measures are some of the commonly used notations in everyday life
such as mean, median,mode,range,symmetry etc.
 Descriptive statistics can be broadly classified into two depending on the
number of attributes under analysis.
 univariate

 multivariate
Univariate Exploration
 Univariate data exploration denotes analysis of one attribute at a
time.eg:Mean,Median
Multivariate Exploration
 Multivariate exploration is the study of more than one attribute in the
dataset simultaneously.
 This technique is critical to understanding the relationship between the
attributes, which is central to data science methods
DATA VISUALIZATION
 Visualizingdata is one of the most important techniques of
data discovery and exploration.
 Thevisual representation of data provides easy
comprehension of complex data with multiple attributes and their
underlying relationships
 It is a graphical representation of data to help people to understand
patterns,trends and insights
 It can take various forms such as charts,graph,map
Univariate Visualization
 Visual exploration starts with investigating one attribute at a
time using univariate charts.
Histogram
A histogram is one of the most basic visualization techniques
to understand the frequency of the occurrence of values
 It shows the distribution of the data by plotting the
frequency of occurrence in a range.
 In a histogram, the attribute under inquiry is shown on the
horizontal axis and the frequency of occurrence is on the vertical axis.
 Histogramsare used to find the central location, range, and
shape of distribution.
2. Quartile
A box whisker plot is a simple visual way of showing the distribution of a
continuous variable with information such as quartiles, median, and outliers,
overlaid by mean and standard deviation.

The quartiles are denoted by Q1, Q2, and Q3 points, which indicate the data points
with a 25% bin size. In a distribution, 25% of the data points will be below Q1, 50%
will be below Q2, and 75% will be below Q3.

The Q1 and Q3 points in a box whisker plot are denoted by the edges of the box.
The Q2 point, the median of the distribution, is indicated by a cross line within the
box. The outliers are denoted by circles at the end of the whisker line.
Diagram shows that the quartile charts for all four attributes of the Iris
dataset are plotted side by side. Petal length can be observed as having
the broadest range and the sepal width has a narrow range, out of all of
the four attributes.
3.Distribution Chart
For continuous numeric attributes like petal length, instead of
visualizing the actual data in the sample, its normal distribution
function can be visualized instead.
The normal distribution function of a continuous random variable is
given by the formula:

where μ is the mean of the distribution and σ is the standard deviation


of the distribution.
Multivariate Visualization
 The multivariate visual exploration considers more than one attribute in
the same visual.
 It focus on the relationship of one attribute with another attribute.
These visualizations examine two to four attributes simultaneously
 . Scatterplot
2.A bubble chart
 A bubble chart is a variation of a simple scatterplot with the addition of one
more attribute, which is used to determine the size of the data point. In the Iris
dataset, petal length and petal width are used for x and y-axis, respectively
and sepal width is used for the size of the data point. The color of the data
point represents a species class label.
Density Chart
 Density charts are similar to the scatterplots, with one more dimension
included as a background color.
 The data point can also be colored to visualize one dimension, and
hence, a total of four dimensions can be visualized in a density chart.
 In the example, petal length is used for the x-axis, sepal length for the
 y-axis, sepal width for the background color, and class label for the data
point color.

You might also like