Unit-I (Data Analytics)
Unit-I (Data Analytics)
UNIT-I
(Data Analytics)
Data Management: Design Data Architecture and manage the data for analysis,
understand various sources of Data like Sensors/Signals/GPS etc. Data
Management, Data Quality (noise, outliers, missing values, duplicate data) and
Data Processing & Processing.
Conceptual model:
It is a business model which uses Entity Relationship (ER) model for relation
between entities and their attributes.
Logical model:It is a model where problems are represented in the form of logic
such as rows and column of data, classes, xml tags and other DBMS techniques.
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college
Physical model:
Physical models hold the database design like which type of database technology
will be suitable for architecture.
Factors that influence Data Architecture:
Few influences that can have an effect on data architecture are business policies,
business requirements, Technology used, economics, and data processing needs.
Business requirements
Business policies
Technology in use
Business economics
Data processing needs
Business requirements:
These include factors such as the expansion of business, the performance
of the system access, data management, transaction management, making
use of raw data by converting them into image files and records, and then
storing in data warehouses. Data warehouses are the main aspects of
storing transactions in business.
Business policies:
The policies are rules that are useful for describing the way of processing
data. These policies are made by internal organizational bodies and other
government agencies.
Technology in use:
This includes using the example of previously completed data architecture
design and also using existing licensed software purchases, database
technology.
Business economics:
The economical factors such as business growth and loss, interest rates,
loans, condition of the market, and the overall cost will also have an effect
on design architecture.
Data processing needs:
These include factors such as mining of the data, large continuous
transactions, database management, and other data preprocessing needs.
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college
Data management:
Data management is an administrative process that includes acquiring,
validating, storing, protecting, and processing required data to ensure the
accessibility, reliability, and timeliness of the data for its users.
Data management software is essential, as we are creating and
consuming data at unprecedented rates.
Data management is the practice of managing data as a valuable resource to
unlock its potential for an organization. Managing data effectively requires having
a data strategy and reliable methods to access, integrate, cleanse, govern, store
and prepare data for analytics. In our digital world, data pours into organizations
from many sources – operational and transactional systems, scanners, sensors,
smart devices, social media, video and text. But the value of data is not based on
its source, quality or format. Its value depends on what you do with it.
Motivation/Importance of Data management:
Data management plays a significant role in an organization's ability to
generate revenue, control costs.
Data management helps organizations to mitigate risks.
It enables decision making in organizations.
What are the benefits of good data management?
Optimum data quality
Improved user confidence
Efficient and timely access to data
Improves decision making in an organization
Managing data Resources:
An information system provides users with timely, accurate, and relevant
information.
The information is stored in computer files. When files are properly
arranged and maintained, users can easily access and retrieve the
information when they need.
If the files are not properly managed, they can lead to chaos in information
processing.
Even if the hardware and software are excellent, the information system
can be very inefficient because of poor file management.
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college
Example:
RBD - The term Randomized Block Design has originated from agricultural
research. In this design several treatments of variables are applied to different
blocks of land to ascertain their effect on the yield of the crop. Blocks are formed
in such a manner that each block contains as many plots as a number of
treatments so that one plot from each is selected at random for each treatment.
The production of each plot is measured after the treatment is given. These data
are then interpreted and inferences are drawn by using the analysis of Variance
Technique so as to know the effect of various treatments like different dozes of
fertilizers, different types of irrigation etc.
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college
LSD - Latin Square Design - A Latin square is one of the experimental designs
which has a balanced two way classification scheme say for example - 4 X 4
arrangement. In this scheme each letter from A to D occurs only once in each row
and also only once in each column. The balance arrangement, it may be noted
that, will not get disturbed if any row gets changed with the other.
The balance arrangement achieved in a Latin Square is its main strength. In this
design, the comparisons among treatments will be free from both differences
between rows and columns. Thus the magnitude of error will be smaller than any
other design.
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college
FD - Factorial Designs - This design allows the experimenter to test two or more
variables simultaneously. It also measures interaction effects of the variables and
analyzes the impacts of each of the variables. In a true experiment,
randomization is essential so that the experimenter can infer cause and effect
without any bias.
A experiment which involves multiple independent variables is known as a
factorial design.
A factor is a major independent variable. In this example we have two factors:
time in instruction and setting. A level is a subdivision of a factor. In this
example, time in instruction has two levels and setting has two levels.
research study but collected these data for some other purpose and at different
time in the past.
If the researcher uses these data then these become secondary data for the
current users. Sources of secondary data are government publications websites,
books, journal articles, internal records.
1. Internal Sources
2. External Sources
Internal Sources –These are within the organization. External Sources - These are
outside the organization
Internal Sources:
If available, internal secondary data may be obtained with less time, effort and
money than the external secondary data. In addition, they may also be more
pertinent to the situation at hand since they are from within the organization.
The internal sources include
Accounting resources- This gives so much information which can be used by
the marketing researcher. They give information about internal factors.
Sales Force Report- It gives information about the sale of a product. The
information provided is of outside the organization.
Internal Experts- These are people who are heading the various departments.
They can give an idea of how a particular thing is working
Miscellaneous Reports- These are what information you are getting from
operational reports. If the data available within the organization are unsuitable or
inadequate, the marketer should extend the search to external secondary data
sources.
Here are a few examples of sensors, just to give an idea of the number and
diversity of their applications:
A photosensor detects the presence of visible light, infrared
transmission (IR) and/or ultraviolet (UV) energy.
Smart grid sensors can provide real-time data about grid
conditions, detecting outages, faults and load and triggering
alarms.
Wireless sensor networks combine specialized transducers
with a communications infrastructure for monitoring and
recording conditions at diverse locations. Commonly monitored
parameters include temperature, humidity, pressure, wind
direction and speed, illumination intensity, vibration intensity,
sound intensity, powerline voltage, chemical concentrations,
pollutant levels and vital body functions.
Signal:
The simplest form of signal is a direct current (DC) that is switched on and off;
this is the principle by which the early telegraph worked. More complex signals
consist of an alternating-current (AC) or electromagnetic carrier that contains one
or more data streams.
Data must be transformed into electromagnetic signals prior to transmission
across a network. Data and signals can be either analog or digital. A signal is
periodic if it consists of a continuously repeating pattern.
Global Positioning System (GPS):
The Global Positioning System (GPS) is a space based navigation system that
provides location and time information in all weather conditions, anywhere on or
near the Earth where there is an unobstructed line of sight to four or more GPS
satellites. The system provides critical capabilities to military, civil, and
commercial users around the world. The United States government created the
system, maintains it, and makes it freely accessible to anyone with a GPS
receiver.
Quality of Data:
Data quality is the ability of your data to serve its intended purpose based on
factors such as accuracy, completeness, consistency, reliability and these factors
that play a huge role in determining data quality.
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college
Accuracy:
Erroneous values that deviate from the expected. The causes for inaccurate data
can be various, which include:
Human/computer errors during data entry and transmission
Users deliberately submitting incorrect values (called disguised
missing data)
Incorrect formats for input fields
Duplication of training examples
Completeness:
Lacking attribute/feature values or values of interest. The dataset might be
incomplete due to:
Unavailability of data
Deletion of inconsistent data
Deletion of data deemed irrelevant initially
Consistency: Inconsistent means data source containing discrepancies between
different data items. Some attributes representing a given concept may have
different names in different databases, causing inconsistencies and
redundancies. Naming inconsistencies may also occur for attribute values.
Reliability: Reliability means that data are reasonably complete and accurate,
meet the intended purposes, and are not subject to inappropriate alteration.
Some other features that also affect the data quality include timeliness (the data
is incomplete until all relevant information is submitted after certain time
periods), believability (how much the data is trusted by the user) and
interpretability (how easily the data is understood by all stakeholders).
To ensure high quality data, it’s crucial to preprocess it. To make the process
easier, data preprocessing is divided into four stages: data cleaning, data
integration, data reduction, and data transformation.
Data Quality is also effected by
Outliers
Missing Values
Noisy
Duplicate Values
Outliers:
Outliers are extreme values that deviate from other observations on data, they
may indicate a variability in a measurement, experimental errors or a novelty. It
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college
Box plot:
A boxplot is a standardized way of displaying the distribution of data based on a
five number summary
Minimum
First quartile (Q1),
Median,
Third quartile (Q3), and
Maximum”).
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college
Deletion methods are used when the nature of missing data is “Missing
completely at random” else non random missing values can bias the
model output.
2. Mean/ Mode/ Median Imputation: Imputation is a method to fill in the
missing values with estimated ones. The objective is to employ known
relationships that can be identified in the valid values of the data set to assist in
estimating the missing values. Mean / Mode / Median imputation is one of the
most frequently used methods. It consists of replacing the missing data for a
given attribute by the mean or median (quantitative attribute) or mode
(qualitative attribute) of all known values of that variable.
It can be of two types:
Generalized Imputation: In this case, we calculate the mean or median
for all non missing values of that variable then replace missing value with
mean or median. Like in above table, variable “Manpower” is missing so we
take average of all non missing values of “Manpower” (28.33) and then
replace missing value with it.
Similar case Imputation: In this case, we calculate average for gender
“Male” (29.75) and “Female” (25) individually of non missing values then
replace the missing value based on gender. For “Male“, we will replace
missing values of manpower with 29.75 and for “Female” with 25.
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college
Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines. It can be
generated due to faulty data collection, data entry errors etc. It can be handled
in following ways:
Binning Method:
This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are performed to
complete the task. Each segmented is handled separately. One can replace all
data in a segment by its mean or boundary values can be used to complete the
task.
Smoothing by bin means: In smoothing by bin means, each value in a bin
is replaced by the mean value of the bin.
Smoothing by bin median: In this method each bin value is replaced by
its bin median value.
Smoothing by bin boundary: In smoothing by bin boundaries, the
minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary value.
Regression: Here data can be made smooth by fitting it to a regression function.
The regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).
Clustering: This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters.
Incorrect attribute values may due to
faulty data collection instruments
data entry problems data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Duplicate values: A dataset may include data objects which are duplicates of
one another. It may happen when say the same person submits a form more than
once. The term deduplication is often used to refer to the process of dealing with
duplicates. In most cases, the duplicates are removed so as to not give that
particular data object an advantage or bias, when running machine learning
algorithms.
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college
Redundant data occurs while we merge data from multiple databases. If the
redundant data is not removed incorrect results will be obtained during data
analysis. Redundant data occurs due to the following reasons.
There are mainly 2 major approaches for data integration – one is “Tight coupling
approach” and another is “Loose coupling approach”.
Tight Coupling:
Here, a data warehouse is treated as an information retrieval component. In this
coupling, data is combined from different sources into a single physical location
through the process of ETL – Extraction, Transformation and Loading.
Loose Coupling:
Here, an interface is provided that takes the query from the user, transforms it in
a way the source database can understand and then sends the query directly to
the source databases to obtain the result.
Issues in Data Integration:
There are three issues to consider during data integration: Schema Integration,
Redundancy Detection, and resolution of data value conflicts. These are
explained in brief below.
Schema Integration: Integrate metadata from different sources. The real-
world entities from multiple sources are referred to as the entity
identification problem.
Redundancy: An attribute may be redundant if it can be derived or
obtained from another attribute or set of attributes. Inconsistencies in
attributes can also cause redundancies in the resulting data set. Some
redundancies can be detected by correlation analysis.
Detection and resolution of data value conflicts: This is the third critical
issue in data integration. Attribute values from different sources may differ
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college
for the same real-world entity. An attribute in one system may be recorded
at a lower level of abstraction than the “same” attribute in another.
3. Data Transformation: This step is taken in order to transform the data in
appropriate forms suitable for mining process. This involves following ways.
Normalization: It is done in order to scale the data values in a
specified range (-1.0 to 1.0 or 0.0 to 1.0).
Min-Max Normalization:
This transforms the original data linearly. Suppose that min_F is
the minima and max_F is the maxima of an attribute, F
We Have the Formula:
Where v is the value you want to plot in the new range. v’ is the new
value you get after normalizing the old value.
Data compression
It reduces the size of the files using different encoding mechanisms
(Huffman Encoding & run-length Encoding). We can divide it into two types
based on their compression techniques.
Lossless Compression:
Encoding techniques (Run Length Encoding) allows a simple and minimal
data size reduction. Lossless data compression uses algorithms to restore
the precise original data from the compressed data.
Lossy Compression:
Methods such as Discrete Wavelet transform technique, PCA (principal
component analysis) are examples of this compression. In lossy-data
compression, the decompressed data may differ to the original data but are
useful enough to retrieve information from them.
Numerosity Reduction: This enables to store the model of data instead of
whole data, for example: Regression Models.
Dimensionality Reduction: This reduces the size of data by encoding
mechanisms. It can be lossy or lossless. If after reconstruction from
compressed data, original data can be retrieved, such reduction are called
lossless reduction else it is called lossy reduction. The two effective
methods of dimensionality reduction are: Wavelet transforms and PCA
(Principal Componenet Analysis).
5. Data Discretization: Involves the reduction of a number of values of a
continuous attribute by dividing the range of attribute intervals. Data
discretization refers to a method of converting a huge number of data values into
smaller ones so that the evaluation and management of data become easy. In
other words, data discretization is a method of converting attributes values of
continuous data into a finite set of intervals with minimum data loss.
Suppose we have an attribute of Age with the given values
Important Questions
1. Discuss about Data Management? How to manage the data for analysis?
2. How to design Data Architecture? What are the factors that influences the
data architecture?
3. Define Primary Data sources and also explain about the types of primary
data sources?
4. What is Data Pre-processing? Discuss about various steps to pre-process
the data?
5. Explain about Missing values and how to eliminate missing values?
6. Define Quality of data and discuss various factors that affect the Quality of
data?
7. Discuss about various sources of data?
8. What is Noisy data? Explain about methods to handle noisy data?
9. Discuss about different methods in experimental data sources (CRD, RBD,
LSD and FA)?
10. Differentiate internal and external secondary data sourses with examples?
11. What is data transformation in data preprocessing and discuss about
different normalization techniques?
12. Discuss the process of handling duplicate values in organizational data.
Briefly describe various sources of data like sensors, signals GPS in data
management?