Unit-1 Introduction To Data Mining
Unit-1 Introduction To Data Mining
The Knowledge Discovery in Databases process comprises of a few steps leading from raw
data collections to some form of new knowledge. The iterative process consists of the
following steps:
• Data cleaning: also known as data cleansing, it is a phase in which noise data and
irrelevant data are removed from the collection.
• Data integration: at this stage, multiple data sources, often heterogeneous, may be
combined in a common source.
• Data selection: at this step, the data relevant to the analysis is decided on and retrieved
from the data collection.
• Data transformation: also known as data consolidation, it is a phase in which the
selected data is transformed into forms appropriate for the mining procedure.
• Data mining: it is the crucial step in which clever techniques are applied to extract
patterns potentially useful.
• Pattern evaluation: in this step, strictly interesting patterns representing knowledge are
identified based on given measures.
• Knowledge representation: is the final phase in which the discovered knowledge is
visually represented to the user. This essential step uses visualization techniques to help
users understand and interpret the data mining results. It is common to combine some of
these steps together. For instance, data cleaning and data integration can be performed
together as a pre-processing phase to generate a data warehouse. Data selection and data
transformation can also be combined where the consolidation of the data is the result of the
selection, or, as for the case of data warehouses, the selection is done on transformed data.
The KDD is an iterative process. Once the discovered knowledge is presented to the user,
the evaluation measures can be enhanced, the mining can be further refined, new data can
be selected or further transformed, or new data sources can be integrated, in order to get
different, more appropriate results. Data mining derives its name from the similarities
between searching for valuable information in a large database and mining rocks for a vein
of valuable ore. Both imply either sifting through a large amount of material or ingeniously
probing the material to exactly pinpoint where the values reside. It is, however, a misnomer,
since mining for gold in rocks is usually called “gold mining” and not “rock mining”, thus
by analogy, data mining should have been called “knowledge mining” instead.
Nevertheless, data mining became the accepted customary term, and very rapidly a trend
that even overshadowed more general terms such as knowledge discovery in databases
(KDD) that describe a more complete process. Other similar terms referring to data mining
are: data dredging, knowledge extraction and pattern discovery.
Predictive and descriptive Predictive Techniques
Predictive Data Mining:
The main goal of this mining is to say something about future results not of current
behaviour. It uses the supervised learning functions which are used to predict the target
value. The methods come under this type of mining category are called classification,
time-series analysis and regression. Modelling of data is the necessity of the predictive
analysis, and it works by utilizing a few variables of the present to predict the future not
known data values for other variables.
Descriptive Data Mining:
This term is basically used to produce correlation, cross-tabulation, frequency etc. These
technologies are used to determine the similarities in the data and to find existing patterns.
One more application of descriptive analysis is to develop the captivating subgroups in the
major part of the data available.
This analytics emphasis on the summarization and transformation of the data into
meaningful information for reporting and monitoring.
It requires data
aggregation and data It requires statistics and
4. Require mining forecasting methods
Type of
5. approach Reactive approach Proactive approach
Outlier Analysis:
A database may contain data objects that do not comply with the general behavior or
model of the data. These data objects are outliers. Most data mining methods discard
outliers as noise or exceptions. The analysis of outlier data is referred to as outlier
mining.
• Regression: A regression problem is when the output variable is a real value, such as
“dollars” or “weight”.
Supervised learning deals with or learns with “labeled” data. This implies that some data
is already tagged with the correct answer.
Types:-
• Regression
• Logistic Regression
• Classification
• Naive Bayes Classifiers
• K-NN (k nearest neighbors)
• Decision Trees
• Support Vector Machine
Advantages:-
• Supervised learning allows collecting data and produces data output from previous
experiences.
• Helps to optimize performance criteria with the help of experience.
• Supervised machine learning helps to solve various types of real-world computation
problems.
Disadvantages:-
• Classifying big data can be challenging.
• Training for supervised learning needs a lot of computation time. So, it requires a lot of
time.
Steps
Unsupervised learning:
Unsupervised learning is the training of a machine using information that is neither
classified nor labelled and allowing the algorithm to act on that information without
guidance. Here the task of the machine is to group unsorted information according to
similarities, patterns, and differences without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will be given to
the machine. Therefore the machine is restricted to find the hidden structure in unlabeled
data by itself.
It allows the model to work on its own to discover patterns and information that was
previously undetected. It mainly deals with unlabelled data.
• Association: An association rule learning problem is where you want to discover rules
that describe large portions of your data, such as people that buy X also tend to buy Y.
Types of Unsupervised Learning:-
Clustering
1. Exclusive (partitioning)
2. Agglomerative
3. Overlapping
4. Probabilistic
Clustering Types:-
1. Hierarchical clustering
2. K-means clustering
3. Principal Component Analysis
4. Singular Value Decomposition
5. Independent Component Analysis
Supervised vs. Unsupervised Machine Learning
Computational
Complexity Simpler method Computationally complex
1. Mining Methodology
2. User Interaction
1) Mining Methodology
Data mining covers a wide spectrum of data analysis and knowledge discovery
tasks, so these tasks may use the same database in different ways and require the
development of numerous data mining techniques.
When searching for knowledge in large data sets, we can explore the data in
multidimensional space.
That is, we can search for interesting patterns among combinations of dimensions
(attributes) at varying levels of abstraction. Such mining is known as (exploratory)
multidimensional data mining.
For example, to mine data with natural language text, it makes sense to fuse data
mining methods of information retrieval and natural language processing.
Errors and noise may confuse the data mining process, leading to the derivation of
erroneous patterns.
2) User Interaction
Interactive mining
The data mining process should be highly interactive. Thus, it is important to build
flexible user interfaces and an exploratory mining environment, facilitating the
user’s interaction with the system.
How any system can present data mining results, vividly(clear image in mind) and
flexibly ?, so that the discovered knowledge can be easily understood and
directly usable by humans.
Data mining algorithms must be efficient and scalable in order to effectively extract
information from huge amounts of data lies in many data repositories or in dynamic
data streams. In other words, the running time of a data mining algorithm must be
predictable, short, and acceptable by applications. Efficiency, scalability,
performance, optimization, and the ability to execute in real time are key criteria for
new mining algorithms.
The giant size of many data sets, the wide distribution of data, and the
computational complexity of some data mining methods are factors that motivate
the development of parallel and distributed data-intensive mining algorithms.
In mining various types of attributes are available and also different types of data in
database or dataset.
Mining dynamic, networked, and global data repositories
Data from multiple sources are connected by the Internet and various kinds of
networks like distributed and heterogeneous global information systems. The
discovery of knowledge from different sources of structured, semi- structured, or
unstructured challengeable. Web Mining, multisource data mining and information
network mining have become challenging and fast-evolving data mining fields.
With data mining penetrating our everyday lives, it is important to study the impact
of data mining on society, How can we used at a mining technology to benefit our
society? How can we guard against its misuse?
Invisible data mining:We cannot expect everyone in society to learn and master in
data mining techniques. For example, when purchasing items online, users may be
unaware that the store is likely collecting data on the buying patterns of its
customers, which may be used to recommend other items for purchase in the future.
1)Nominal
2)Ordinal
3)Interval
4)Ratio
Nominal attribute
Nominal attributes are named attributes which can be separated into discrete
(individual) categories which do not overlap.
Example
Ordinal attribute
Ordinal attribute is the order of the values, that’s important and significant, but the
differences between each one is not really known.
Example
Ratings à ,
Interval attribute
Interval attribute comes in the form of a numerical value where the difference between
points is meaningful.
Example
Ratio attribute
Ratio attribute is looks like interval attribute, but it must have a true zero (absolute)
value.
It tells us about the order and the exact value between units or data.
Example
• Describe data using measures of central tendency and dispersion: for a set of
individual data values, and for a set of grouped data.
Learning Objectives
for a set of individual data values, and for a set of grouped data.
• When we gather data, we want to uncover the “information” in it. One easy way to
do that is to think of: “Shape –Center- Spread”
Key Terms
Mean:
Mean is the average of a dataset.
To find the mean, calculate the sum of all the data and then divide by the total
number of data.
Example
✔Find out mean for 12, 15, 11, 11, 7, 13
Median
Median is the middle number in a dataset when the data is arranged in numerical order
(Sorted Order).
Median(Odd)
▪ Example
✓ Find out Median for 12, 15, 11, 11, 7, 13, 15
Median - Even
Example:
Find out median for 12,15,11,11,7,13
Mode
Mode
Range
The range of a set of data is the difference between the largest and the smallest number
in the set.
Example
Find range for given data 40, 30, 43, 48, 26, 50, 55, 40, 34, 42, 47, 50
Standard deviation can be thought of measuring how far the data values lie from the
mean, we take the mean and move on standard deviation in either direction.
The mean for this example is 49.2 and the standard deviation is 17.
Now, 49.2 - 17 = 32.2 and 49.2 + 17 = 66.2
This means that most of the data probably spend between 32.2 and 66.2.
If all data are same then variance & standard deviation is 0 (zero).
Data Visualization
Data type to be visualized
1)1-D data, usually the dimension is very dense.
E.g. temporal data, like time series of stock prices.
2)2-D data.
E.g. geographical maps
3)Multi-Dimension
E.g. tables from relational databases
No simple mapping of attributes to the two dimensions of the screen
4) Text and hypertext, e.g. news articles
Most of the standard visualization techniques cannot be applied. In most cases, a
transformation of the data into description vectors is necessary first.
E.g. word counting, then principal component analysis.
5) Hierarchies and graphs
E.g. telephone calls
6) Algorithms and software
E.g. for debugging operations
Visualization technique
1) standard 2D/3D displays
e.g. bar charts and x-y plots.
2) geometrically transformed displays
e.g. parallel coordinates.
3) icon-based displays (glyphs)
4) dense pixel displays
5) stacked displays
Tailored to present data partitioned in a hierarchical fashion.
Embed one coordinate system inside another coordinate system.
. Interaction and distortion technique
Dynamic: changes to visualizations are made automatically
Interactive: changes are made manually
1) Dynamic projections
e.g. To show all interesting two-dimensional projections of a multi-dimensional dataset as a
series of scatter plots.
2) Interactive filtering
browsing: direct selection of desired subset
querying: specify properties of desired subsets
Interactive zooming
On higher zoom levels, more details are shown.
4) Interactive distortion
Show portions of the data with high level of detail while other s are shown with lower.
E.g. spherical distortion and fisheye views.
5) Interactive Linking and Brushing
– Combine different visualization methods to overcome the shortcomings of
single techniques.
– Changes to one visualization are automatically reflected in the other
visualization.
Measuring Data Similarity and Dissimilarity
– Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance
function, which is typically metric: d(i, j)
– There is a separate “quality” function that measures the “goodness” of a cluster.
– The definitions of distance functions are usually very different for interval-scaled,
boolean, categorical, ordinal and ratio variables.
– Weights should be associated with different variables based on applications and data
semantics.
– It is hard to define “similar enough” or “good enough”
– the answer is typically highly subjective.
Interval-scaled variables:
Binary variables:
Nominal, ordinal, and ratio variables:
Variables of mixed types
Interval-valued variables
Standardize data
Calculate the mean absolute deviation (MAD):
Where
Using mean absolute deviation is more robust than using standard deviation
Similarity and Dissimilarity between Objects
◼ f q = 2, d is Euclidean distance:
◼ Properties
◼ d(i,j) 0
◼ d(i,i) = 0
◼ d(i,j) = d(j,i)
◼ d(i,j) d(i,k) + d(k,j)
◼ Also one can use weighted distance, parametric Pearson product moment correlation,
or other dissimilarity measures.
A contingency table for binary data
Object j
Object i
imple matching coefficient (invariant, if the binary variable is symmetric):
Nominal Variables
A generalization of the binary variable in that it can take more than 2 states, e.g., red,
yellow, blue, green
Method 1: Simple matching
m: # of matches, p: total # of variables
Method 2: use a large number of binary variables
creating a new binary variable for each of the M nominal states
Ordinal Variables
An ordinal variable can be discrete or continuous
order is important, e.g., rank
Can be treated like interval-scaled
f is binary or nominal:
dij(f) = 0 if xif = xjf or missing values, or dij(f) = 1 o.w.
f is interval-based: use the normalized distance
f is ordinal or ratio-scaled