Partitioning Method
Partitioning Method
Mining
Partitioning Method:
This clustering method classifies the information into multiple groups based on the
characteristics and similarity of the data. Its the data analysts to specify the number
of clusters that has to be generated for the clustering methods. In the partitioning
method when database(D) that contains multiple(N) objects then the partitioning
method constructs user-specified(K) partitions of the data in which each partition
represents a cluster and a particular region. There are many algorithms that come
under partitioning method some of the popular ones are K-Mean, PAM(K-Medoids),
CLARA algorithm (Clustering Large Applications) etc. In this article, we will be
seeing the working of K Mean algorithm in detail. K-Mean (A centroid-based
Technique): The K means algorithm takes the input parameter K from the user
and partitions the dataset containing N objects into K clusters so that resulting
similarity among the data objects inside the group (intra cluster) is high but the
similarity of data objects with the data objects from outside the cluster is low (inter
cluster). The similarity of the cluster is determined with respect to the mean value
of the cluster. It is a type of square error algorithm. At the start randomly k objects
from the dataset are chosen in which each of the objects represents a cluster
mean(centre). For the rest of the data objects, they are assigned to the nearest
cluster based on their distance from the cluster mean. The new mean of each of
the cluster is then calculated with the added data objects.
To Assist Backup/Recovery
If we do not partition the fact table, then we have to load the complete fact table with all the
data. Partitioning allows us to load only as much data as is required on a regular basis. It reduces
the time to load and also enhances the performance of the system.
Note − To cut down on the backup size, all partitions other than the current partition can be
marked as read-only. We can then put these partitions into a state where they cannot be
modified. Then they can be backed up. It means only the current partition is to be backed up.
To Enhance Performance
By partitioning the fact table into sets of data, the query procedures can be enhanced. Query
performance is enhanced because now the query scans only those partitions that are relevant.
It does not have to scan the whole data.
Horizontal Partitioning
There are various ways in which a fact table can be partitioned. In horizontal partitioning, we
have to keep in mind the requirements for manageability of the data warehouse.
Points to Note
The detailed information remains available online.
The number of physical tables is kept relatively small, which reduces the
operating cost.
This technique is suitable where a mix of data dipping recent history and data
mining through entire history is required.
This technique is not useful where the partitioning profile changes on a regular
basis, because repartitioning will increase the operation cost of data warehouse.
Partition on a Different Dimension
The fact table can also be partitioned on the basis of dimensions other than time such as product
group, region, supplier, or any other dimension. Let's have an example.
Suppose a market function has been structured into distinct regional departments like on a state
by state basis. If each region wants to query on information captured within its region, it would
prove to be more effective to partition the fact table into regional partitions. This will cause the
queries to speed up because it does not require to scan information that is not relevant.
Points to Note
The query does not have to scan irrelevant data which speeds up the query
process.
This technique is not appropriate where the dimensions are unlikely to change
in future. So, it is worth determining that the dimension does not change in
future.
If the dimension changes, then the entire fact table would have to be
repartitioned.
Note − We recommend to perform the partition only on the basis of time dimension, unless
you are certain that the suggested dimension grouping will not change within the life of the
data warehouse.
Points to Note
This partitioning is complex to manage.
It requires metadata to identify what data is stored in each partition.
Partitioning Dimensions
If a dimension contains large number of entries, then it is required to partition the dimensions.
Here we have to check the size of a dimension.
Consider a large design that changes over time. If we need to store all the variations in order to
apply comparisons, that dimension may be very large. This would definitely affect the response
time.
Vertical Partition
Vertical partitioning, splits the data vertically. The following images depicts how vertical
partitioning is done.
Vertical partitioning can be performed in the following two ways −
Normalization
Row Splitting
Normalization
Normalization is the standard relational method of database organization. In this method, the
rows are collapsed into a single row, hence it reduces space. Take a look at the following tables
that show how normalization is performed.
Table before Normalization
64 san Mumbai S
30 5 3.67 3-Aug-13 16
35 4 5.33 3-Sep-13 16
40 5 2.50 3-Sep-13 64
45 7 5.66 3-Sep-13 16
Row Splitting
Row splitting tends to leave a one-to-one map between partitions. The motive of row splitting
is to speed up the access to large table by reducing its size.
Note − While using vertical partitioning, make sure that there is no requirement to perform a
major join operation between two partitions.
region
transaction_date
Suppose the business is organized in 30 geographical regions and each region has different
number of branches. That will give us 30 partitions, which is reasonable. This partitioning is
good enough because our requirements capture has shown that a vast majority of queries are
restricted to the user's own business region.
If we partition by transaction_date instead of region, then the latest transaction from every
region will be in one partition. Now the user who wants to look at data within his own region
has to query across multiple partitions.
Hence it is worth determining the right partitioning key.
Algorithm: K mean:
Input:
K: The number of clusters in which the dataset has to be divided
D: A dataset containing N number of objects
Output:
A dataset of K clusters
Method:
1. Randomly assign K objects from the dataset(D) as cluster centres(C)
2. (Re) Assign each object to which object is most similar based upon mean
values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster with
the updated values.
4. Repeat Step 2 until no change occurs.
Figure – K-mean
Clustering Flowchart:
Figure – K-mean Clustering
Example: Suppose we want to group the visitors to a website using just their
age as follows:
16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61,
62, 66
Initial Cluster:
K=2
Centroid(C1) = 16 [16]
Centroid(C2) = 22 [22]
Note: These two points are chosen randomly from the dataset. Iteration-1:
C1 = 16.33 [16, 16, 17]
C2 = 37.25 [20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61,
62, 66]
Iteration-2:
C1 = 19.55 [16, 16, 17, 20, 20, 21, 21, 22, 23]
C2 = 46.90 [29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-3:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-4:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
No change Between Iteration 3 and 4, so we stop. Therefore, we get the
clusters (16-29) and (36-66) as 2 clusters we get using K Mean Algorithm.