0% found this document useful (0 votes)
483 views22 pages

Unit-I (Data Analytics)

The document discusses data architecture and data management. It defines data architecture as the process of standardizing how organizations collect, store, transform, distribute, and use data. Data architecture design involves three essential models: the conceptual model, logical model, and physical model. Several factors influence data architecture, including business requirements, policies, technology used, economics, and data processing needs. The document also discusses data management, which includes acquiring, validating, storing, protecting, and processing data to ensure accessibility, reliability, and timeliness. Key areas of data management are data modeling, data warehousing, and data movement. The document outlines different sources of data, including primary sources like surveys, observations, and experiments, as well as secondary sources.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
483 views22 pages

Unit-I (Data Analytics)

The document discusses data architecture and data management. It defines data architecture as the process of standardizing how organizations collect, store, transform, distribute, and use data. Data architecture design involves three essential models: the conceptual model, logical model, and physical model. Several factors influence data architecture, including business requirements, policies, technology used, economics, and data processing needs. The document also discusses data management, which includes acquiring, validating, storing, protecting, and processing data to ensure accessibility, reliability, and timeliness. Key areas of data management are data modeling, data warehousing, and data movement. The document outlines different sources of data, including primary sources like surveys, observations, and experiments, as well as secondary sources.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 22

D.Krishna, Assoc.

Prof, CSE Dept, ACE Engineering college

UNIT-I
(Data Analytics)
Data Management: Design Data Architecture and manage the data for analysis,
understand various sources of Data like Sensors/Signals/GPS etc. Data
Management, Data Quality (noise, outliers, missing values, duplicate data) and
Data Processing & Processing.

Design Data Architecture:


Data architecture is the process of standardizing how organizations collect,
store, transform, distribute, and use data. The goal is to deliver relevant data to
people who need it, when they need it, and help them make sense of it. Data
architecture design is set of standards which are composed of certain policies,
rules and models.
Data is usually one of several architecture domains that form the pillars of an
enterprise architecture or solution architecture. The data architecture is formed
by dividing into three essential models
 Conceptual model
 Logical model
 Physical model

Conceptual model:
It is a business model which uses Entity Relationship (ER) model for relation
between entities and their attributes.
Logical model:It is a model where problems are represented in the form of logic
such as rows and column of data, classes, xml tags and other DBMS techniques.
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college

Physical model:
Physical models hold the database design like which type of database technology
will be suitable for architecture.
Factors that influence Data Architecture:
Few influences that can have an effect on data architecture are business policies,
business requirements, Technology used, economics, and data processing needs.
 Business requirements
 Business policies
 Technology in use
 Business economics
 Data processing needs
Business requirements:
These include factors such as the expansion of business, the performance
of the system access, data management, transaction management, making
use of raw data by converting them into image files and records, and then
storing in data warehouses. Data warehouses are the main aspects of
storing transactions in business.
Business policies:
The policies are rules that are useful for describing the way of processing
data. These policies are made by internal organizational bodies and other
government agencies.
Technology in use:
This includes using the example of previously completed data architecture
design and also using existing licensed software purchases, database
technology.
Business economics:
The economical factors such as business growth and loss, interest rates,
loans, condition of the market, and the overall cost will also have an effect
on design architecture.
Data processing needs:
These include factors such as mining of the data, large continuous
transactions, database management, and other data preprocessing needs.
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college

Data management:
Data management is an administrative process that includes acquiring,
validating, storing, protecting, and processing required data to ensure the
accessibility, reliability, and timeliness of the data for its users.
Data management software is essential, as we are creating and
consuming data at unprecedented rates.
Data management is the practice of managing data as a valuable resource to
unlock its potential for an organization. Managing data effectively requires having
a data strategy and reliable methods to access, integrate, cleanse, govern, store
and prepare data for analytics. In our digital world, data pours into organizations
from many sources – operational and transactional systems, scanners, sensors,
smart devices, social media, video and text. But the value of data is not based on
its source, quality or format. Its value depends on what you do with it.
Motivation/Importance of Data management:
 Data management plays a significant role in an organization's ability to
generate revenue, control costs.
 Data management helps organizations to mitigate risks.
 It enables decision making in organizations.
What are the benefits of good data management?
 Optimum data quality
 Improved user confidence
 Efficient and timely access to data
 Improves decision making in an organization
Managing data Resources:
 An information system provides users with timely, accurate, and relevant
information.
 The information is stored in computer files. When files are properly
arranged and maintained, users can easily access and retrieve the
information when they need.
 If the files are not properly managed, they can lead to chaos in information
processing.
 Even if the hardware and software are excellent, the information system
can be very inefficient because of poor file management.
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college

Areas of Data Management:


Data Modeling: Is first creating a structure for the data that you collect and use
and then organizing this data in a way that is easily accessible and efficient to
store and pull the data for reports and analysis.
Data warehousing: is storing data effectively so that it can be accessed and used
efficiently in future
Data Movement: is the ability to move data from one place to another. For
instance, data needs to be moved from where it is collected to a database and
then to an end user.
Understand various sources of the Data:
Data are the special type of information generally obtained through observations,
surveys, inquiries, or are generated as a result of human activity. Methods of
data collection are essential for anyone who wish to collect data.
Data collection is a fundamental aspect and as a result, there are different
methods of collecting data which when used on one particular set will result in
different kinds of data.
Collection of data refers to a purpose gathering of information and relevant to the
subject-matter of the study from the units under investigation. The method of
collection of data mainly depends upon the nature, purpose and the scope of
inquiry on one hand and availability of resources, and the time to the other.
Data can be generated from two types of sources namely
1. Primary sources of data
2. Secondary sources of data
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college

1. Primary sources of data:


Primary data refers to the first hand data gathered by the researcher himself.
Sources of primary data are surveys, observations, Experimental Methods.
Survey: Survey method is one of the primary sources of data which is used
to collect quantitative information about an items in a population. Surveys are
used in different areas for collecting the data even in public and private sectors.
A survey may be conducted in the field by the researcher. The respondents are
contacted by the research person personally, telephonically or through mail. This
method takes a lot of time, efforts and money but the data collected are of high
accuracy, current and relevant to the topic.
When the questions are administered by a researcher, the survey is called a
structured interview or a researcher-administered survey.
Observations: Observation as one of the primary sources of data. Observation is
a technique for obtaining information involves measuring variables or gathering
of data necessary for measuring the variable under investigation.
Observation is defined as accurate watching and noting of phenomena as they
occur in nature with regards to cause and effect relation.
Interview: Interviewing is a technique that is primarily used to gain an
understanding of the underlying reasons and motivations for people’s attitudes,
preferences or behavior. Interviews can be undertaken on a personal one-to-one
basis or in a group.
Experimental Method: There are number of experimental designs that are used
in carrying out and experiment. However, Market researchers have used 4
experimental designs most frequently. These are
CRD - Completely Randomized Design
RBD - Randomized Block Design
LSD - Latin Square Design
FD - Factorial Designs
CRD: A completely randomized design (CRD) is one where the treatments are
assigned completely at random so that each experimental unit has the same
chance of receiving any one treatment.
CRD is appropriate only for experiments with homogeneous experimental
units.
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college

Example:

RBD - The term Randomized Block Design has originated from agricultural
research. In this design several treatments of variables are applied to different
blocks of land to ascertain their effect on the yield of the crop. Blocks are formed
in such a manner that each block contains as many plots as a number of
treatments so that one plot from each is selected at random for each treatment.
The production of each plot is measured after the treatment is given. These data
are then interpreted and inferences are drawn by using the analysis of Variance
Technique so as to know the effect of various treatments like different dozes of
fertilizers, different types of irrigation etc.
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college

LSD - Latin Square Design - A Latin square is one of the experimental designs
which has a balanced two way classification scheme say for example - 4 X 4
arrangement. In this scheme each letter from A to D occurs only once in each row
and also only once in each column. The balance arrangement, it may be noted
that, will not get disturbed if any row gets changed with the other.

The balance arrangement achieved in a Latin Square is its main strength. In this
design, the comparisons among treatments will be free from both differences
between rows and columns. Thus the magnitude of error will be smaller than any
other design.
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college

FD - Factorial Designs - This design allows the experimenter to test two or more
variables simultaneously. It also measures interaction effects of the variables and
analyzes the impacts of each of the variables. In a true experiment,
randomization is essential so that the experimenter can infer cause and effect
without any bias.
A experiment which involves multiple independent variables is known as a
factorial design.
A factor is a major independent variable. In this example we have two factors:
time in instruction and setting. A level is a subdivision of a factor. In this
example, time in instruction has two levels and setting has two levels.

For example, suppose a botanist wants to understand the effects of


sunlight (low vs. high) and watering frequency (daily vs. weekly) on the
growth of a certain species of plant.

This is an example of a 2×2 factorial design because there are two


independent variables, each with two levels:
Independent variable #1: Sunlight (Levels: Low, High)
Independent variable #2: Watering Frequency (Levels: Daily, Weekly)
2. Secondary sources of Data:
While secondary sources means data collected by someone else earlier.
Secondary data are the data collected by a party not related to the
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college

research study but collected these data for some other purpose and at different
time in the past.
If the researcher uses these data then these become secondary data for the
current users. Sources of secondary data are government publications websites,
books, journal articles, internal records.
1. Internal Sources
2. External Sources
Internal Sources –These are within the organization. External Sources - These are
outside the organization

 Internal Sources:
If available, internal secondary data may be obtained with less time, effort and
money than the external secondary data. In addition, they may also be more
pertinent to the situation at hand since they are from within the organization.
The internal sources include
Accounting resources- This gives so much information which can be used by
the marketing researcher. They give information about internal factors.
Sales Force Report- It gives information about the sale of a product. The
information provided is of outside the organization.
Internal Experts- These are people who are heading the various departments.
They can give an idea of how a particular thing is working
Miscellaneous Reports- These are what information you are getting from
operational reports. If the data available within the organization are unsuitable or
inadequate, the marketer should extend the search to external secondary data
sources.

 External Sources of Data:


External Sources are sources which are outside the company in a larger
environment. Collection of external data is more difficult because the data have
much greater variety and the sources are much more numerous.
Government Publications- Government sources provide an extremely rich pool
of data for the researchers. In addition, many of these data are available free of
cost on internet websites.
There are number of government agencies generating data.
These are:
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college

Registrar General of India- It is an office which generates demographic data. It


includes details of gender, age, occupation etc.
Central Statistical Organization- This organization publishes the national
accounts statistics. It contains estimates of national income for several years,
growth rate, and rate of major economic activities. Annual survey of Industries is
also published by the CSO. It gives information about the total number of
workers employed, production units, material used and value added by the
manufacturer.
Director General of Commercial Intelligence- This office operates from
Kolkata. It gives information about foreign trade i.e. import and export. These
figures are provided region-wise and country-wise.
Ministry of Commerce and Industries- This ministry through the office of
economic advisor provides information on wholesale price index. These indices
may be related to a number of sectors like food, fuel, power, food grains etc. It
also generates All India Consumer Price Index numbers for industrial workers,
urban, non manual employees and cultural labourers.
Planning Commission- It provides the basic statistics of Indian Economy.
Reserve Bank of India- This provides information on Banking Savings and
investment. RBI also prepares currency and finance reports.
Labour Bureau- It provides information on skilled, unskilled, white collared jobs
National Sample Survey- This is done by the Ministry of Planning and it provides
social, economic, demographic, industrial and agricultural statistics.
Department of Economic Affairs- It conducts economic survey and it also
generates information on income, consumption, expenditure, investment, savings
and foreign trade.
State Statistical Abstract- This gives information on various types of activities
related to the state like - commercial activities, education, occupation etc.
Non-Government Publications- These includes publications of various
industrial and trade associations, such as The Indian Cotton Mill Association
Various chambers of commerce.
Understand various sources of Data like Sensors/signal/GPS etc:
Sensor data:
 Sensor data is the output of a device that detects and responds to some
type of input from the physical environment. The output may be used to
provide information or input to another system or to guide a process.
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college

 Here are a few examples of sensors, just to give an idea of the number and
diversity of their applications:
 A photosensor detects the presence of visible light, infrared
transmission (IR) and/or ultraviolet (UV) energy.
 Smart grid sensors can provide real-time data about grid
conditions, detecting outages, faults and load and triggering
alarms.
 Wireless sensor networks combine specialized transducers
with a communications infrastructure for monitoring and
recording conditions at diverse locations. Commonly monitored
parameters include temperature, humidity, pressure, wind
direction and speed, illumination intensity, vibration intensity,
sound intensity, powerline voltage, chemical concentrations,
pollutant levels and vital body functions.
Signal:
The simplest form of signal is a direct current (DC) that is switched on and off;
this is the principle by which the early telegraph worked. More complex signals
consist of an alternating-current (AC) or electromagnetic carrier that contains one
or more data streams.
Data must be transformed into electromagnetic signals prior to transmission
across a network. Data and signals can be either analog or digital. A signal is
periodic if it consists of a continuously repeating pattern.
Global Positioning System (GPS):
The Global Positioning System (GPS) is a space based navigation system that
provides location and time information in all weather conditions, anywhere on or
near the Earth where there is an unobstructed line of sight to four or more GPS
satellites. The system provides critical capabilities to military, civil, and
commercial users around the world. The United States government created the
system, maintains it, and makes it freely accessible to anyone with a GPS
receiver.
Quality of Data:
Data quality is the ability of your data to serve its intended purpose based on
factors such as accuracy, completeness, consistency, reliability and these factors
that play a huge role in determining data quality.
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college

Accuracy:
Erroneous values that deviate from the expected. The causes for inaccurate data
can be various, which include:
 Human/computer errors during data entry and transmission
 Users deliberately submitting incorrect values (called disguised
missing data)
 Incorrect formats for input fields
 Duplication of training examples
Completeness:
Lacking attribute/feature values or values of interest. The dataset might be
incomplete due to:
 Unavailability of data
 Deletion of inconsistent data
 Deletion of data deemed irrelevant initially
Consistency: Inconsistent means data source containing discrepancies between
different data items. Some attributes representing a given concept may have
different names in different databases, causing inconsistencies and
redundancies. Naming inconsistencies may also occur for attribute values.
Reliability: Reliability means that data are reasonably complete and accurate,
meet the intended purposes, and are not subject to inappropriate alteration.
Some other features that also affect the data quality include timeliness (the data
is incomplete until all relevant information is submitted after certain time
periods), believability (how much the data is trusted by the user) and
interpretability (how easily the data is understood by all stakeholders).
To ensure high quality data, it’s crucial to preprocess it. To make the process
easier, data preprocessing is divided into four stages: data cleaning, data
integration, data reduction, and data transformation.
Data Quality is also effected by
 Outliers
 Missing Values
 Noisy
 Duplicate Values
Outliers:
Outliers are extreme values that deviate from other observations on data, they
may indicate a variability in a measurement, experimental errors or a novelty. It
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college

is a point or an observation that deviates significantly from the other


observations.
Outlier detection from graphical representation:
 Scatter plot and
 Box plot
Scatter plot:
Scatter plots are used to plot data points on a horizontal and a vertical axis in the
attempt to show how much one variable is affected by another. A scatter
plot uses dots to represent values for two different numeric variables.

Box plot:
A boxplot is a standardized way of displaying the distribution of data based on a
five number summary
 Minimum
 First quartile (Q1),
 Median,
 Third quartile (Q3), and
 Maximum”).
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college

Most common causes of outliers on a data set:


 Data entry errors (human errors)
 Measurement errors (instrument errors)
 Experimental errors (data extraction or experiment
planning/executing errors)
 Intentional (dummy outliers made to test detection methods)
 Data processing errors (data manipulation or data set unintended
mutations)
 Sampling errors (extracting or mixing data from wrong or various
sources)
 Natural (not an error, novelties in data)
How to remove Outliers?
Most of the ways to deal with outliers are similar to the methods of missing
values like deleting observations, transforming them, binning them, treat them as
a separate group, imputing values and other statistical methods. Here, we will
discuss the common techniques used to deal with outliers:
Deleting observations: We delete outlier values if it is due to data entry error,
data processing error or outlier observations are very small in numbers. We can
also use trimming at both ends to remove outliers.
Transforming and binning values: Transforming variables can also eliminate
outliers. Natural log of a value reduces the variation caused by extreme values.
Binning is also a form of variable transformation. Decision Tree algorithm allows
to deal with outliers well due to binning of variable. We can also use the process
of assigning weights to different observations.
Imputing: Like imputation of missing values, we can also impute outliers. We
can use mean, median, mode imputation methods. Before imputing values, we
should analyse if it is natural outlier or artificial. If it is artificial, we can go with
imputing values. We can also use statistical model to predict values of outlier
observation and after that we can impute it with predicted values.
Missing data:
Missing data in the training data set can reduce the power / fit of a model or can
lead to a biased model because we have not analysed the behavior and
relationship with other variables correctly. It can lead to wrong prediction or
classification.
Why my data has missing values?
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college

We looked at the importance of treatment of missing values in a dataset. Now,


let’s identify the reasons for occurrence of these missing values. They may occur
at two stages:
1. Data Extraction: It is possible that there are problems with extraction
process. In such cases, we should double-check for correct data with data
guardians. Some hashing procedures can also be used to make sure data
extraction is correct. Errors at data extraction stage are typically easy to find and
can be corrected easily as well.
2. Data collection: These errors occur at time of data collection and are harder
to correct. They can be categorized in four types:
 Missing completely at random: This is a case when the probability
of missing variable is same for all observations. For example:
respondents of data collection process decide that they will declare
their earning after tossing a fair coin. If an head occurs, respondent
declares his / her earnings & vice versa. Here each observation has
equal chance of missing value.
 Missing at random: This is a case when variable is missing at
random and missing ratio varies for different values / level of other
input variables. For example: We are collecting data for age and
female has higher missing value compare to male.
 Missing that depends on unobserved predictors: This is a case
when the missing values are not random and are related to the
unobserved input variable. For example: In a medical study, if a
particular diagnostic causes discomfort, then there is higher chance
of drop out from the study. This missing value is not at random
unless we have included “discomfort” as an input variable for all
patients.
 Missing that depends on the missing value itself: This is a case
when the probability of missing value is directly correlated with
missing value itself. For example: People with higher or lower income
are likely to provide non-response to their earning.
Which are the methods to treat missing values?
1. Deletion: It is of two types: List Wise Deletion and Pair Wise Deletion.
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college

 In list wise deletion, we delete observations where any of the variable is


missing. Simplicity is one of the major advantage of this method, but this
method reduces the power of model because it reduces the sample size.
 In pair wise deletion, we perform analysis with all cases in which the
variables of interest are present. Advantage of this method is, it keeps as
many cases available for analysis. One of the disadvantage of this method,
it uses different sample size for different variables.

 Deletion methods are used when the nature of missing data is “Missing
completely at random” else non random missing values can bias the
model output.
2. Mean/ Mode/ Median Imputation: Imputation is a method to fill in the
missing values with estimated ones. The objective is to employ known
relationships that can be identified in the valid values of the data set to assist in
estimating the missing values. Mean / Mode / Median imputation is one of the
most frequently used methods. It consists of replacing the missing data for a
given attribute by the mean or median (quantitative attribute) or mode
(qualitative attribute) of all known values of that variable.
It can be of two types:
 Generalized Imputation: In this case, we calculate the mean or median
for all non missing values of that variable then replace missing value with
mean or median. Like in above table, variable “Manpower” is missing so we
take average of all non missing values of “Manpower” (28.33) and then
replace missing value with it.
 Similar case Imputation: In this case, we calculate average for gender
“Male” (29.75) and “Female” (25) individually of non missing values then
replace the missing value based on gender. For “Male“, we will replace
missing values of manpower with 29.75 and for “Female” with 25.
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college

Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines. It can be
generated due to faulty data collection, data entry errors etc. It can be handled
in following ways:
Binning Method:
This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are performed to
complete the task. Each segmented is handled separately. One can replace all
data in a segment by its mean or boundary values can be used to complete the
task.
 Smoothing by bin means: In smoothing by bin means, each value in a bin
is replaced by the mean value of the bin.
 Smoothing by bin median: In this method each bin value is replaced by
its bin median value.
 Smoothing by bin boundary: In smoothing by bin boundaries, the
minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary value.
Regression: Here data can be made smooth by fitting it to a regression function.
The regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).
Clustering: This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters.
Incorrect attribute values may due to
 faulty data collection instruments
 data entry problems data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
Duplicate values: A dataset may include data objects which are duplicates of
one another. It may happen when say the same person submits a form more than
once. The term deduplication is often used to refer to the process of dealing with
duplicates. In most cases, the duplicates are removed so as to not give that
particular data object an advantage or bias, when running machine learning
algorithms.
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college

Redundant data occurs while we merge data from multiple databases. If the
redundant data is not removed incorrect results will be obtained during data
analysis. Redundant data occurs due to the following reasons.

 Object identification: The same attribute or object may have


different names in different databases

 Derivable data: One attribute may be a “derived” attribute in


another table, e.g., annual revenue

Redundant attributes may be able to be detected by correlation analysis Careful


integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality
Data Pre-processing:
Data preprocessing is a data mining technique that involves transforming
raw data into an understandable format. Real-world data is often incomplete,
inconsistent, and/or lacking in certain behaviors or trends, and is likely to
contain many errors.
Major Tasks in Data Preprocessing are
 Data Cleaning
 Data Integration
 Data Transformation
 Data Reduction
1. Data Cleaning: Data is cleansed through processes such as filling in missing
values, smoothing the noisy data, or resolving the inconsistencies in the data.
Data cleaning tasks
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration
Incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
e.g., occupation=“ ”
Noisy: containing errors or outlier value that deviate from the expected.
e.g., Salary=“-10”
Inconsistent: containing discrepancies in codes or names
e.g., Age=“42” Birthday=“03/07/1997”
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college

e.g., Was rating “1,2,3”, now rating “A, B, C”


e.g., discrepancy between duplicate records
2. Data Integration: Data with different representations are put together and
conflicts within the data are resolved. Integration of multiple databases, data
cubes, or files.

There are mainly 2 major approaches for data integration – one is “Tight coupling
approach” and another is “Loose coupling approach”.
Tight Coupling:
Here, a data warehouse is treated as an information retrieval component. In this
coupling, data is combined from different sources into a single physical location
through the process of ETL – Extraction, Transformation and Loading.
Loose Coupling:
Here, an interface is provided that takes the query from the user, transforms it in
a way the source database can understand and then sends the query directly to
the source databases to obtain the result.
Issues in Data Integration:
There are three issues to consider during data integration: Schema Integration,
Redundancy Detection, and resolution of data value conflicts. These are
explained in brief below.
 Schema Integration: Integrate metadata from different sources. The real-
world entities from multiple sources are referred to as the entity
identification problem.
 Redundancy: An attribute may be redundant if it can be derived or
obtained from another attribute or set of attributes. Inconsistencies in
attributes can also cause redundancies in the resulting data set. Some
redundancies can be detected by correlation analysis.
 Detection and resolution of data value conflicts: This is the third critical
issue in data integration. Attribute values from different sources may differ
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college

for the same real-world entity. An attribute in one system may be recorded
at a lower level of abstraction than the “same” attribute in another.
3. Data Transformation: This step is taken in order to transform the data in
appropriate forms suitable for mining process. This involves following ways.
 Normalization: It is done in order to scale the data values in a
specified range (-1.0 to 1.0 or 0.0 to 1.0).
Min-Max Normalization:
This transforms the original data linearly. Suppose that min_F is
the minima and max_F is the maxima of an attribute, F
We Have the Formula:

Where v is the value you want to plot in the new range. v’ is the new
value you get after normalizing the old value.

 Attribute Selection: In this strategy, new attributes are constructed


from the given set of attributes to help the mining process.
 Discretization: This is done to replace the raw values of numeric
attribute by interval levels or conceptual levels.
 Concept Hierarchy Generation: Here attributes are converted from
level to higher level in hierarchy. For Example-The attribute “city”
can be converted to “country”.
4. Data Reduction: Since data mining is a technique that is used to handle huge
amount of data. While working with huge volume of data, analysis became harder
in such cases. In order to get rid of this, we uses data reduction technique. It
aims to increase the storage efficiency and reduce data storage and analysis
costs.
The various steps to data reduction are:
 Data Cube Aggregation: Aggregation operation is applied to data for the
construction of the data cube.
 Attribute Subset Selection: The highly relevant attributes should be used,
rest all can be discarded. For performing attribute selection, one can use
level of significance and p- value of the attribute.the attribute having p-
value greater than significance level can be discarded.
D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college

 Data compression
It reduces the size of the files using different encoding mechanisms
(Huffman Encoding & run-length Encoding). We can divide it into two types
based on their compression techniques.
Lossless Compression:
Encoding techniques (Run Length Encoding) allows a simple and minimal
data size reduction. Lossless data compression uses algorithms to restore
the precise original data from the compressed data.
Lossy Compression:
Methods such as Discrete Wavelet transform technique, PCA (principal
component analysis) are examples of this compression. In lossy-data
compression, the decompressed data may differ to the original data but are
useful enough to retrieve information from them.
 Numerosity Reduction: This enables to store the model of data instead of
whole data, for example: Regression Models.
 Dimensionality Reduction: This reduces the size of data by encoding
mechanisms. It can be lossy or lossless. If after reconstruction from
compressed data, original data can be retrieved, such reduction are called
lossless reduction else it is called lossy reduction. The two effective
methods of dimensionality reduction are: Wavelet transforms and PCA
(Principal Componenet Analysis).
5. Data Discretization: Involves the reduction of a number of values of a
continuous attribute by dividing the range of attribute intervals. Data
discretization refers to a method of converting a huge number of data values into
smaller ones so that the evaluation and management of data become easy. In
other words, data discretization is a method of converting attributes values of
continuous data into a finite set of intervals with minimum data loss.
Suppose we have an attribute of Age with the given values

Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77

Attribute Age Age Age Age

1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78

After Discretization Child Young Mature Old


D.Krishna, Assoc.Prof, CSE Dept, ACE Engineering college

Important Questions
1. Discuss about Data Management? How to manage the data for analysis?
2. How to design Data Architecture? What are the factors that influences the
data architecture?
3. Define Primary Data sources and also explain about the types of primary
data sources?
4. What is Data Pre-processing? Discuss about various steps to pre-process
the data?
5. Explain about Missing values and how to eliminate missing values?
6. Define Quality of data and discuss various factors that affect the Quality of
data?
7. Discuss about various sources of data?
8. What is Noisy data? Explain about methods to handle noisy data?
9. Discuss about different methods in experimental data sources (CRD, RBD,
LSD and FA)?
10. Differentiate internal and external secondary data sourses with examples?
11. What is data transformation in data preprocessing and discuss about
different normalization techniques?
12. Discuss the process of handling duplicate values in organizational data.
Briefly describe various sources of data like sensors, signals GPS in data
management?

You might also like