Data Mining-Unit IV
Data Mining-Unit IV
Introduction to cluster analysis, Mining complex type of data: Multidimensional analysis and
descriptive mining of complex data objects, Spatial databases, Multimedia databases,
Mining time series and sequence data, Mining text databases, Mining World Wide Web,
Data Chunking Techniques.
What is Clustering?
Clustering is the process of making a group of abstract objects into classes of
similar objects.
Points to Remember
A cluster of data objects can be treated as one group.
While doing cluster analysis, we first partition the set of data into groups based on data similarity and
then assign the labels to the groups.
The main advantage of clustering over classification is that, it is adaptable to changes and helps single
out useful features that distinguish different groups.
Clustering can also help marketers discover distinct groups in their customer base. And they can
characterize their customer groups based on the purchasing patterns.
In the field of biology, it can be used to derive plant and animal taxonomies, categorize genes with
similar functionalities and gain insight into structures inherent to populations.
Clustering also helps in identification of areas of similar land use in an earth observation database. It
also helps in the identification of groups of houses in a city according to house type, value, and
geographic location.
Clustering also helps in classifying documents on the web for information discovery.
Clustering is also used in outlier detection applications such as detection of credit card fraud.
As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of data to
observe characteristics of each cluster.
Ability to deal with different kinds of attributes − Algorithms should be capable to be applied on any
kind of data such as interval-based (numerical) data, categorical, and binary data.
Discovery of clusters with attribute shape − The clustering algorithm should be capable of detecting
clusters of arbitrary shape. They should not be bounded to only distance measures that tend to find
spherical cluster of small sizes.
High dimensionality − The clustering algorithm should not only be able to handle low-dimensional
data but also the high dimensional space.
Ability to deal with noisy data − Databases contain noisy, missing or erroneous data. Some
algorithms are sensitive to such data and may lead to poor quality clusters.
Clustering Methods
Clustering methods can be classified into the following categories −
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method
constructs ‘k’ partition of data. Each partition will represent a cluster and k ≤ n. It
means that it will classify the data into k groups, which satisfy the following
requirements −
Each group contains at least one object.
Points to remember −
For a given number of partitions (say k), the partitioning method will create an initial partitioning.
Then it uses the iterative relocation technique to improve the partitioning by moving objects from one
group to other.
Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects.
We can classify hierarchical methods on the basis of how the hierarchical
decomposition is formed. There are two approaches here −
Agglomerative Approach
Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each
object forming a separate group. It keeps on merging the objects or groups that are
close to one another. It keep on doing so until all of the groups are merged into one
or until the termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of
the objects in the same cluster. In the continuous iteration, a cluster is split up into
smaller clusters. It is down until each object in one cluster or the termination
condition holds. This method is rigid, i.e., once a merging or splitting is done, it can
never be undone.
Here are the two approaches that are used to improve the quality of hierarchical
clustering −
Perform careful analysis of object linkages at each hierarchical partitioning.
Density-based Method
This method is based on the notion of density. The basic idea is to continue growing
the given cluster as long as the density in the neighborhood exceeds some
threshold, i.e., for each data point within a given cluster, the radius of a given cluster
has to contain at least a minimum number of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite
number of cells that form a grid structure.
Advantages
The major advantage of this method is fast processing time.
It is dependent only on the number of cells in each dimension in the quantized space.
Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data
for a given model. This method locates the clusters by clustering the density
function. It reflects spatial distribution of the data points.
This method also provides a way to automatically determine the number of clusters
based on standard statistics, taking outlier or noise into account. It therefore yields
robust clustering methods.
Constraint-based Method
There are several challenging issues regarding the construction and utilization of spatial
datawarehouses.
1. The first challenge is the integration of spatial data from heterogeneous sources and systems.
2. The second challenge is the realization of fast and flexible on-line analytical processing in
spatial data warehouses.
------------------------------------------------------------------------------------------------
Image feature specification queries specify or sketch image features like color, texture, or
shape, which are translated into a feature vector to be matched with the feature vectors of the
images in the database
Mining Associations in Multimedia Data
1. Associations between image content and non image content features:
A rule like “If at least 50% of the upper part of the picture is blue, then it is likely to
represent sky” belongs to this category since it links the image content to the keyword sky.
2. Associations among image contents that are not related to spatial relationships: A rule like
“If a picture contains two blue squares, then it is likely to contain one red circle a swell”
belongs to this category since the associations are all regarding image contents.
3. Associations among image contents related to spatial relationships: A rule like “If a red
triangle is between two yellow squares, then it is likely a big oval-shaped object is
underneath” belongs to this category since it associates objects in the image with spatial
relationship.
Several approaches have been proposed and studied for similarity-based retrieval in image
databases, based on image signature
1. Color histogram–based signature: In this approach, the signature of an image includes color
histograms based on the color composition of an image regardless of its scale or orientation.
This method does not contain any information about shape, image topology, or texture.
2. Multifeature composed signature: In this approach, the signature of an image includes a
composition of multiple features: color histogram, shape, image topology, and texture.
“Can we construct a data cube for multimedia data analysis?” To facilitate the
multidimensional analysis of large multimedia databases, multimedia data cubes can be
designed and constructed in a manner similar to that for traditional data cubes from relational
data. A multimedia data cube can contain additional dimensions and measures for multimedia
information, such as color, texture, and shape.
Recall: This is the percentage of documents that are relevant to the query and were,
in fact, retrieved. It is formally defined as
How Mining theWorld WideWeb is done?
The World Wide web serves as a huge,widely distributed, global information service center
for news, advertisements, consumer information, financial management, education,
government, e-commerce, and many other information services. The Web also contains
a rich and dynamic collection of hyperlink information and Web page access and usage
information, providing rich sources for data mining.
challenges for effective resource and knowledge discovery in web
1. The Web seems to be too huge for effective data warehousing and data mining. The size of
the Web is in the order of hundreds of terabytes and is still growing rapidly. Many
organizations and societies place most of their public-accessible information on the Web. It is
barely possible to set up a data warehouse to replicate, store, or integrate all of the data on the
Web.
2. The complexity of Web pages is far greater than that of any traditional text document
collection. Web pages lack a unifying structure.
3. The Web is a highly dynamic information source. Not only does the Web grow rapidly,
but its information is also constantly updated.
4. TheWeb serves a broad diversity of user communities. The Internet currently connects
more than 100 million workstations, and its user community is still rapidly expanding.
These challenges have promoted research into efficient and effective discovery and use of
resources on the Internet.
Mining the WWW
1. Mining theWeb Page Layout Structure
The basic structure of a Web page is its DOM(Document Object Model) structure. The DOM
structure of a Web page is a tree structure, where every HTML tag in the page corresponds to
a node in the DOM tree. The Web page can be segmented by some predefined structural tags.
Thus the DOM structure can be used to facilitate information extraction.
Here, we introduce an algorithm called VIsion-based Page Segmentation (VIPS).
VIPS aims to extract the semantic structure of a Web page based on its visual presentation
2. Mining the Web’s Link Structures to Identify Authoritative Web Pages
The Web consists not only of pages, but also of hyperlinks pointing from one page to
another.
These hyperlinks contain an enormous amount of latent human annotation that can help
automatically infer the notion of authority. These properties of Web link structures have led
researchers to consider another important category of Web pages called a hub. A hub is one
or a set ofWeb pages that provides collections of links to authorities.
Data chunking
Data deduplication is an emerging technology that introduces reduction of storage utilization and an efficient way
of handling data replication in the backup environment. In cloud data storage, the deduplication technology plays
a major role in the virtual machine framework, data sharing network, and structured and unstructured data
handling by social media and, also, disaster recovery. In the deduplication technology, data are broken down into
multiple pieces called “chunks” and every chunk is identified with a unique hash identifier. These identifiers are
used to compare the chunks with previously stored chunks and verified for duplication. Since the chunking
algorithm is the first step involved in getting efficient data deduplication ratio and throughput, it is very important
in the deduplication scenario.
What is data chunking? How can chunking help to organize large multidimensional datasets for both fast
and flexible data access? How should chunk shapes and sizes be chosen? Can software such as netCDF-4 or
HDF5 provide better defaults for chunking? If you're interested in those questions and some of the issues
they raise, read on ...
Is anyone still there? OK, let's start with a real-world example of the improvements possible with chunking
in netCDF-4. You may be surprised, as I was, by the results. Maybe looking at examples in this post will help
provide some guidance for effective use of chunking in other similar cases.
First let's consider a single large 3D variable from the NCEP North American Regional Reanalysis
representing air temperature (if you must know, it's at the 200 millibars level, at every 3 hours, on a 32.463
km resolution grid, over 33 years from 1979-01-01 through 2012-07-31):
dimensions:
y = 277 ;
x = 349 ;
time = 98128 ;
variables:
float T(time, y, x);
Of course the file has lots of other metadata specifying units, coordinate system, and data provenance, but in
terms of size it's mostly just that one big variable: 9.5 billion values comprising 38 GB of data.
This file is an example of PrettyBigData (PBD). Even though you can store it on a relatively cheap flash
drive, it's too big to deal with quickly. Just copying it to a 7200 rpm spinning disk takes close to 20 minutes.
Even copying to fast solid state disk (SSD) takes over 4 minutes. For a human-scale comparison, its close to
the storage used for a blu-ray version of a typical movie, about 50 GB. (As an example of ReallyBigData
(RBD), a data volume beyond the comprehension of ordinary humans, consider the 3D, 48 frame per second
version of "The Hobbit, Director's Cut".)
1. Get a 1D time-series of all the data from a specific spatial grid point.
The first kind of access is asking for the 1D array of values on one of the red lines, pictured on the left,
below; the second is asking for the 2D array of values on one of the green planes pictured on the right.
With a conventional contiguous (index-order) storage layout, the time dimension varies most slowly, y
varies faster, and x varies fastest. In this case, the spatial access is fast (0.013 sec) and the time series access
is slow (180 sec, which is 14,000 times slower). If we instead want the time series to be quick, we can
reorganize the data so x or y is the most slowly varying dimension and time varies fastest, resulting in fast
time-series access (0.012 sec) and slow spatial access (200 sec, 17,000 times slower). In either case, the slow
access is so slow that it makes the data essentially inaccessible for all practical purposes, e.g. in analysis or
visualization.
But what if we want both kinds of access to be relatively fast? Well, we could punt and make two versions of
the file, each organized appropriately for the kind of access desired. But that solution doesn't scale well. For
N-dimensional data, you would need N copies of the data to support optimal access along any axis, and N!
copies to support optimal access to any cross-section defined by some subset of the N dimensions.
A better solution, known for at least 30 years, is the use of chunking, storing multidimensional data in
multi-dimensional rectangular chunks to speed up slow accesses at the cost of slowing down fast accesses.
Programs that access chunked data can be oblivious to whether or how chunking is used. Chunking is
supported in the HDF5 layer of netCDF-4 files, and is one of the features, along with per-chunk
compression, that led to a proposal to use HDF5 as a storage layer for netCDF-4 in 2002.
Benefits of Chunking
I think the benefits of chunking are under-appreciated.
Large performance gains are possible with good choices of chunk shapes and sizes. Chunking also supports
efficiently extending multidimensional data along multiple axes (in netCDF-4, this is called "multiple
unlimited dimensions") as well as efficient per-chunk compression, so reading a subset of a compressed
variable doesn't require uncompressing the whole variable.
So why isn't chunking more widely used? I think reasons include at least the following:
1. Advice for how to choose chunk shapes and sizes for specific patterns of access is lacking.
2. Default chunk shapes and sizes for libraries such as netCDF-4 and HDF5 work poorly in some
common cases.
3. It's costly to rewrite big datasets that use conventional contiguous layouts to use chunking instead.
For example, even if you can fit the whole variable, uncompressed, in memory, chunking a 38GB
variable can take 20 or 30 minutes.
This series of posts and better guidance in software documentation will begin to address the first problem.
HDF5 already has a start with a white paper on chunking.
The second reason for under-use of chunking is not so easily addressed. Unfortunately, there are no general-
purpose chunking defaults that are optimal for all uses. Different patterns of access lead to different chunk
shapes and sizes for optimum access. Optimizing for a single specific pattern of access can degrade
performance for other access patterns.
Finally, the cost of chunking data means that you either need to get it right when the data is created, or the
data must be important enough that the cost of rechunking for many read accesses is justified. In the latter
case, you may want to consider acquiring a computing platform with lots of memory and SSD, just for the
purpose of rechunking important datasets.
Here's a table of timings for various shapes and sizes of chunks, using conventional local 7200 rpm spinning
disk with 4096-byte physical disk blocks, the kind of storage that's still prevalent on desk-top and
departmental scale platforms:
We've already seen the timings in the first two rows of this table, showing huge performance bias when
using contiguous layouts. The third row shows the current netCDF-4 default for chunking this data,
choosing chunk sizes close to 4 MB and trying to equalize the number of chunks along any axis. This turns
out not to be particularly good for trying to balance 1D and 2D accesses. The fourth row shows results of
smaller chunk sizes, using shapes that provide a better balance between time series and spatial slice
accesses for this dataset.
I think the last row of this table supports the main point to be made in this first posting on chunking data.
By creating or rewriting important large multidimensional datasets using appropriate chunking, you can
tailor their access characteristics to make them more useful. Proper use of chunking can support more than
one common query pattern.
That's enough for now. In part 2, we'll discuss how to determine good chunk shapes, present a general way
to balance access times for 1D and 2D accesses in 3D variables, say something about generalizations to
higher dimension variables, and provide examples of rechunking times using the nccopy and h5repack
utilities.
#!/bin/sh
# Flush and clear the disk caches.
sync
sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
On OSX, the only privilege required is knowledge of the "purge" command, which we wrap in a shell script
to make our benchmarks work the same on OSX as on Linux:
#!/bin/sh
# Flush and clear the disk caches.
purge