Data Science S3mca
Data Science S3mca
learning
Module 1
Data science
Data Science can be defined as the study of
data, where it comes from, what it represents,
and the ways by which it can be transformed into
valuable inputs and resources to create business
and IT strategies.
This is also
shirt
Key features and motivations:-
1.Extracting Meaningful Patterns
2.Building Representative Models
3.Combination of Statistics, Machine Learning, and
Computing
4.Learning Algorithms
DATA SCIENCE CLASSIFICATION
●
Prior knowledge refers to information that is
already known about a subject.
● This helps to define
– what problem is being solved
– how it fits in the business context
– What data is needed in order to solve the
problem
PRIOR KNOWLEDGE
Objective:Without a well-defined statement of the problem, it is impossible to come
up with the right dataset and pick the right data science algorithm.
Subject Area:The process of data science uncovers hidden patterns in the dataset by
exposing relationships between attributes. But the problem is that it uncovers a lot of
patterns. The false signals are a major problem in the data science process. . Hence,
it is essential to know the subject matter, the context, and the business process
generating the data.
Data:Similar to the prior knowledge in the subject area, prior knowledge in the data
can also be gathered. Understanding how the data is collected, stored, transformed,
reported, and used is essential to the data science process.
Terminologies
●
A data set(example set) is a collection of data with a defined
structure.This structure is also sometimes referred to as a “data
frame”.
●
A data point (record, object or example) is a single instance in the
dataset.Each row in the Table is a data point.Each instance contains
the same structure as the dataset.
●
An attribute (feature, input, dimension, variable, or predictor)is a
singleproperty of the dataset.Each column in the Table is an
attribute.Attributes can be numeric, categorical, date-time, text, or
Boolean data types
●
A label (class label, output, prediction, target, or response) is the
special attribute to be predicted based on all the input attributes
●
Identifiers are special attributes that are used for locating or providing
context to individual records
Causation Versus Correlation:Correlation is a statistical measure (expressed as a
number) that describes the size and direction of a relationship between two or more
variables.
Causation indicates that one event is the result of the occurrence of the other event;
i.e. there is a causal relationship between the two events. This is also referred to as
cause and effect.
Causation Versus Correlation
Correlation means there is a relationship or pattern between the
values of two variables.
There are three ways to describe the correlation between
variables.
Positive correlation: As xxx increases, yyy increases.
Negative correlation: As xxx increases, yyy decreases.
No correlation: As xxx increases, yyy stays about the same or
has no clear pattern.
●
2.2 Data Quality
●
Data quality is an ongoing concern wherever data is collected,
● processed, and stored.
●
Organizations use data alerts, cleansing, and transformation techniques to
improve and manage the quality of the data and store them in
companywide repositories called data warehouses.
●
Data sourced from well-maintained data warehouses have higher quality,
as there are proper controls in place to ensure a level of data accuracy for
●
new and existing data.
The data cleansing practices include elimination of duplicate records,
●
quarantining outlier records that exceed the bound, substitution of missing
●
values, etc.
2.3 Missing Values
One of the most common data quality issues is that some records have missing
●
attribute values. For example, a credit score may be missing in one of the records.
●
The first step of managing missing values is to understand the reason behind why
the values are missing.
● The missing value can be substituted with a range of artificial data so that the issue
can be managed.
●
Missing credit score values can be replaced with a credit score derived from the
dataset (mean, minimum, or maximum value, depending on the characteristics of the
● attribute).
●
2.4 Data Types and Conversion
The
●
attributes in a dataset can be of different types, such as continuous
numeric (interest rate), integer numeric (credit score), or categorical.
For
● example, the credit score can be expressed as categorical values (poor,
good, excellent) or numeric score. Different data science algorithms impose
different
● restrictions on the attribute data types.
If the available data are categorical, they must be converted to continuous
numeric attribute. A specific numeric score can be encoded for each category value,
●
such as poor=400, good=600, excellent=700, etc.
Numeric values can be converted to categorical data types by a technique
●
called binning
2.6 Outliers
●
2.7 Feature Selection
●
3. MODELING
●
3.1Training and Testing Datasets
The dataset used to create the model, with known attributes and
target, is called the training dataset.
●
The validity of the created model will also need to be checked with
another known dataset called the test dataset or validation dataset.
●
To facilitate this process, the overall known dataset can be split into a
training dataset and a test dataset.
●
●
3.2 Learning Algorithms
●
The business question and the availability of data
will dictate what data science task (association,
classification, regression, etc.,) can to be used.
●
The practitioner determines the appropriate data
science algorithm within the chosen category.
●
For example, within a classification task many
algorithms can be chosen from: decision trees,
neural networks, Bayesian models, k-NN, etc.
3.3 Evaluation of the Model
●
Model evaluation is used to test the performance of the model.
The model is tested with known records ; where these records were not
used to build the model.
●
The actual value of the oputput can be compared against the predicted value
using the model, and thus, the prediction error can be calculated.
As long as the error is acceptable, this model is ready for deployment.
● The error rate can be used to compare this model with other models
developed using different algorithms like neural networks or Bayesian models,
etc
●
3.3 Ensemble Modeling
Ensemble modeling is a process where multiple diverse
●
multivariate
Univariate Exploration
Univariate data exploration denotes analysis of one attribute at a
time.eg:Mean,Median
Multivariate Exploration
Multivariate exploration is the study of more than one attribute in the
dataset simultaneously.
This technique is critical to understanding the relationship between the
attributes, which is central to data science methods
DATA VISUALIZATION
Visualizingdata is one of the most important techniques of
data discovery and exploration.
Thevisual representation of data provides easy
comprehension of complex data with multiple attributes and their
underlying relationships
It is a graphical representation of data to help people to understand
patterns,trends and insights
It can take various forms such as charts,graph,map
Univariate Visualization
Visual exploration starts with investigating one attribute at a
time using univariate charts.
Histogram
A histogram is one of the most basic visualization techniques
to understand the frequency of the occurrence of values
It shows the distribution of the data by plotting the
frequency of occurrence in a range.
In a histogram, the attribute under inquiry is shown on the
horizontal axis and the frequency of occurrence is on the vertical axis.
Histogramsare used to find the central location, range, and
shape of distribution.
2. Quartile
A box whisker plot is a simple visual way of showing the distribution of a
continuous variable with information such as quartiles, median, and outliers,
overlaid by mean and standard deviation.
The quartiles are denoted by Q1, Q2, and Q3 points, which indicate the data points
with a 25% bin size. In a distribution, 25% of the data points will be below Q1, 50%
will be below Q2, and 75% will be below Q3.
The Q1 and Q3 points in a box whisker plot are denoted by the edges of the box.
The Q2 point, the median of the distribution, is indicated by a cross line within the
box. The outliers are denoted by circles at the end of the whisker line.
Diagram shows that the quartile charts for all four attributes of the Iris
dataset are plotted side by side. Petal length can be observed as having
the broadest range and the sepal width has a narrow range, out of all of
the four attributes.
3.Distribution Chart
For continuous numeric attributes like petal length, instead of
visualizing the actual data in the sample, its normal distribution
function can be visualized instead.
The normal distribution function of a continuous random variable is
given by the formula: