0% found this document useful (0 votes)

49 views

Partitioning Method

Uploaded by

ishita.sengupta.06

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views

Partitioning Method

Uploaded by

ishita.sengupta.06

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Partitioning Method (K-Mean) in Data

Mining
Partitioning Method:

This clustering method classifies the information into multiple groups based on the
characteristics and similarity of the data. Its the data analysts to specify the number
of clusters that has to be generated for the clustering methods. In the partitioning
method when database(D) that contains multiple(N) objects then the partitioning
method constructs user-specified(K) partitions of the data in which each partition
represents a cluster and a particular region. There are many algorithms that come
under partitioning method some of the popular ones are K-Mean, PAM(K-Medoids),
CLARA algorithm (Clustering Large Applications) etc. In this article, we will be
seeing the working of K Mean algorithm in detail. K-Mean (A centroid-based
Technique): The K means algorithm takes the input parameter K from the user
and partitions the dataset containing N objects into K clusters so that resulting
similarity among the data objects inside the group (intra cluster) is high but the
similarity of data objects with the data objects from outside the cluster is low (inter
cluster). The similarity of the cluster is determined with respect to the mean value
of the cluster. It is a type of square error algorithm. At the start randomly k objects
from the dataset are chosen in which each of the objects represents a cluster
mean(centre). For the rest of the data objects, they are assigned to the nearest
cluster based on their distance from the cluster mean. The new mean of each of
the cluster is then calculated with the added data objects.

Partitioning is done to enhance performance and facilitate easy management of data.

Partitioning also helps in balancing the various requirements of the system. It optimizes the
hardware performance and simplifies the management of data warehouse by partitioning each
fact table into multiple separate partitions. In this chapter, we will discuss different partitioning
strategies.

Why is it Necessary to Partition?

Partitioning is important for the following reasons −

 For easy management,

 To assist backup/recovery,
 To enhance performance.
For Easy Management
The fact table in a data warehouse can grow up to hundreds of gigabytes in size. This huge size
of fact table is very hard to manage as a single entity. Therefore it needs partitioning.

To Assist Backup/Recovery
If we do not partition the fact table, then we have to load the complete fact table with all the
data. Partitioning allows us to load only as much data as is required on a regular basis. It reduces
the time to load and also enhances the performance of the system.
Note − To cut down on the backup size, all partitions other than the current partition can be
marked as read-only. We can then put these partitions into a state where they cannot be
modified. Then they can be backed up. It means only the current partition is to be backed up.

To Enhance Performance
By partitioning the fact table into sets of data, the query procedures can be enhanced. Query
performance is enhanced because now the query scans only those partitions that are relevant.
It does not have to scan the whole data.

Horizontal Partitioning
There are various ways in which a fact table can be partitioned. In horizontal partitioning, we
have to keep in mind the requirements for manageability of the data warehouse.

Partitioning by Time into Equal Segments

In this partitioning strategy, the fact table is partitioned on the basis of time period. Here each
time period represents a significant retention period within the business. For example, if the
user queries for month to date data then it is appropriate to partition the data into monthly
segments. We can reuse the partitioned tables by removing the data in them.

Partition by Time into Different-sized Segments

This kind of partition is done where the aged data is accessed infrequently. It is implemented
as a set of small partitions for relatively current data, larger partition for inactive data.

Points to Note
 The detailed information remains available online.
 The number of physical tables is kept relatively small, which reduces the
operating cost.
 This technique is suitable where a mix of data dipping recent history and data
mining through entire history is required.
 This technique is not useful where the partitioning profile changes on a regular
basis, because repartitioning will increase the operation cost of data warehouse.
Partition on a Different Dimension
The fact table can also be partitioned on the basis of dimensions other than time such as product
group, region, supplier, or any other dimension. Let's have an example.
Suppose a market function has been structured into distinct regional departments like on a state
by state basis. If each region wants to query on information captured within its region, it would
prove to be more effective to partition the fact table into regional partitions. This will cause the
queries to speed up because it does not require to scan information that is not relevant.

Points to Note
 The query does not have to scan irrelevant data which speeds up the query
process.
 This technique is not appropriate where the dimensions are unlikely to change
in future. So, it is worth determining that the dimension does not change in
future.
 If the dimension changes, then the entire fact table would have to be
repartitioned.
Note − We recommend to perform the partition only on the basis of time dimension, unless
you are certain that the suggested dimension grouping will not change within the life of the
data warehouse.

Partition by Size of Table

When there are no clear basis for partitioning the fact table on any dimension, then we
should partition the fact table on the basis of their size. We can set the predetermined size
as a critical point. When the table exceeds the predetermined size, a new table partition is
created.

Points to Note
 This partitioning is complex to manage.
 It requires metadata to identify what data is stored in each partition.
Partitioning Dimensions
If a dimension contains large number of entries, then it is required to partition the dimensions.
Here we have to check the size of a dimension.
Consider a large design that changes over time. If we need to store all the variations in order to
apply comparisons, that dimension may be very large. This would definitely affect the response
time.

Round Robin Partitions

In the round robin technique, when a new partition is needed, the old one is archived. It uses
metadata to allow user access tool to refer to the correct table partition.
This technique makes it easy to automate table management facilities within the data
warehouse.

Vertical Partition
Vertical partitioning, splits the data vertically. The following images depicts how vertical
partitioning is done.
Vertical partitioning can be performed in the following two ways −

 Normalization
 Row Splitting
Normalization
Normalization is the standard relational method of database organization. In this method, the
rows are collapsed into a single row, hence it reduces space. Take a look at the following tables
that show how normalization is performed.
Table before Normalization

Product_id Qty Value sales_date Store_id Store_name Location Region

30 5 3.67 3-Aug-13 16 Sunny Bangalore S

35 4 5.33 3-Sep-13 16 Sunny Bangalore S

40 5 2.50 3-Sep-13 64 San Mumbai W

45 7 5.66 3-Sep-13 16 Sunny Bangalore S

Table after Normalization

Store_id Store_name Location Region

16 sunny Bangalore W

64 san Mumbai S

Product_id Quantity Value sales_date Store_id

30 5 3.67 3-Aug-13 16

35 4 5.33 3-Sep-13 16

40 5 2.50 3-Sep-13 64

45 7 5.66 3-Sep-13 16

Row Splitting
Row splitting tends to leave a one-to-one map between partitions. The motive of row splitting
is to speed up the access to large table by reducing its size.
Note − While using vertical partitioning, make sure that there is no requirement to perform a
major join operation between two partitions.

Identify Key to Partition

It is very crucial to choose the right partition key. Choosing a wrong partition key will lead to
reorganizing the fact table. Let's have an example. Suppose we want to partition the following
table.
Account_Txn_Table
transaction_id
account_id
transaction_type
value
transaction_date
region
branch_name
We can choose to partition on any key. The two possible keys could be

 region
 transaction_date
Suppose the business is organized in 30 geographical regions and each region has different
number of branches. That will give us 30 partitions, which is reasonable. This partitioning is
good enough because our requirements capture has shown that a vast majority of queries are
restricted to the user's own business region.
If we partition by transaction_date instead of region, then the latest transaction from every
region will be in one partition. Now the user who wants to look at data within his own region
has to query across multiple partitions.
Hence it is worth determining the right partitioning key.

Algorithm: K mean:
Input:
K: The number of clusters in which the dataset has to be divided
D: A dataset containing N number of objects

Output:
A dataset of K clusters

Method:
1. Randomly assign K objects from the dataset(D) as cluster centres(C)
2. (Re) Assign each object to which object is most similar based upon mean
values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster with
the updated values.
4. Repeat Step 2 until no change occurs.

Figure – K-mean

Clustering Flowchart:
Figure – K-mean Clustering

Example: Suppose we want to group the visitors to a website using just their
age as follows:
16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61,
62, 66
Initial Cluster:

K=2
Centroid(C1) = 16 [16]
Centroid(C2) = 22 [22]
Note: These two points are chosen randomly from the dataset. Iteration-1:
C1 = 16.33 [16, 16, 17]
C2 = 37.25 [20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61,
62, 66]
Iteration-2:
C1 = 19.55 [16, 16, 17, 20, 20, 21, 21, 22, 23]
C2 = 46.90 [29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-3:

C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-4:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
No change Between Iteration 3 and 4, so we stop. Therefore, we get the
clusters (16-29) and (36-66) as 2 clusters we get using K Mean Algorithm.

SQL Server Partitioning
100% (2)
SQL Server Partitioning
20 pages
HR Abap
100% (1)
HR Abap
28 pages
Syllabus
No ratings yet
Syllabus
5 pages
20762C 03
No ratings yet
20762C 03
29 pages
SAP Service Management: Advanced Configuration
From Everand
SAP Service Management: Advanced Configuration
Mike Piehl
4.5/5 (3)
Learn SAP BI in 24 Hours
From Everand
Learn SAP BI in 24 Hours
Alex Nordeen
3/5 (1)
Git - The Simple Guide - No Deep Shit! PDF
No ratings yet
Git - The Simple Guide - No Deep Shit! PDF
14 pages
Data Mining Questions
No ratings yet
Data Mining Questions
9 pages
Partitioning Strategy
No ratings yet
Partitioning Strategy
17 pages
UNIT 4_NOTES-1
No ratings yet
UNIT 4_NOTES-1
17 pages
Partitioning
No ratings yet
Partitioning
8 pages
7 1590 PDF
No ratings yet
7 1590 PDF
7 pages
3 RD Unit Partioning
No ratings yet
3 RD Unit Partioning
3 pages
Partitioning - DW
No ratings yet
Partitioning - DW
14 pages
Partitioning PDF
No ratings yet
Partitioning PDF
5 pages
Table Partitioning in SQL Server
No ratings yet
Table Partitioning in SQL Server
11 pages
u4 - 5 i o Parallelism
No ratings yet
u4 - 5 i o Parallelism
8 pages
Database Partitioning A Review Paper
No ratings yet
Database Partitioning A Review Paper
4 pages
A Comprehensive Guide To Oracle Partitioning With Samples
No ratings yet
A Comprehensive Guide To Oracle Partitioning With Samples
36 pages
Oracle 11g Partitioning
No ratings yet
Oracle 11g Partitioning
11 pages
Data Partitioning Methods
No ratings yet
Data Partitioning Methods
9 pages
Microsoft - Strategies For Partitioning Relational Data Warehouses in SQL Server
No ratings yet
Microsoft - Strategies For Partitioning Relational Data Warehouses in SQL Server
27 pages
Partitioning in Oracle
No ratings yet
Partitioning in Oracle
5 pages
Data Mining Notes UNIT IV
No ratings yet
Data Mining Notes UNIT IV
19 pages
Table Partitioning:: Secret Weapon For Big Data Problems
No ratings yet
Table Partitioning:: Secret Weapon For Big Data Problems
46 pages
DW - Chap 6
No ratings yet
DW - Chap 6
2 pages
Snowflake or Flatten in A DM
No ratings yet
Snowflake or Flatten in A DM
3 pages
Unit 4
No ratings yet
Unit 4
4 pages
Partitioning in Oracle 9i
100% (8)
Partitioning in Oracle 9i
19 pages
Partitions (Analysis Services - Multidimensional Data) : SQL Server 2012
No ratings yet
Partitions (Analysis Services - Multidimensional Data) : SQL Server 2012
4 pages
Introduction To Data Warehousing
No ratings yet
Introduction To Data Warehousing
46 pages
Data Warehouse Schema
No ratings yet
Data Warehouse Schema
10 pages
Method for developing and partitioning graph-based data warehouses using association rules
No ratings yet
Method for developing and partitioning graph-based data warehouses using association rules
12 pages
Week 04 - 05
No ratings yet
Week 04 - 05
60 pages
Data Partitioning
No ratings yet
Data Partitioning
5 pages
Dimensional Modeling (III)
No ratings yet
Dimensional Modeling (III)
11 pages
BSC IT TB For 5th Semester (Data Warehousing - 53) Kuvempu University
No ratings yet
BSC IT TB For 5th Semester (Data Warehousing - 53) Kuvempu University
7 pages
2 Partitioning+QC+Done
No ratings yet
2 Partitioning+QC+Done
74 pages
Dimensional Modeling
100% (1)
Dimensional Modeling
19 pages
Parallel Databases
No ratings yet
Parallel Databases
19 pages
Partitioning in Oracle 1728042170
No ratings yet
Partitioning in Oracle 1728042170
12 pages
Populating A DW With SS2K
No ratings yet
Populating A DW With SS2K
5 pages
Sq l Questions by Lips A
No ratings yet
Sq l Questions by Lips A
25 pages
Oracle Performance Tuning - Oracle Partitioning - Introduction
No ratings yet
Oracle Performance Tuning - Oracle Partitioning - Introduction
57 pages
BI - Lecture 3 - Kimball Concepts
No ratings yet
BI - Lecture 3 - Kimball Concepts
44 pages
SAP BI 7.0 - InfoCube Partitioning
No ratings yet
SAP BI 7.0 - InfoCube Partitioning
5 pages
Partitions: Creating A Range-Partitioned Table
No ratings yet
Partitions: Creating A Range-Partitioned Table
3 pages
data_partition_survey
No ratings yet
data_partition_survey
23 pages
Basics of Partitioning
100% (1)
Basics of Partitioning
2 pages
KU82 Pivoting Fact Table With Fact Dimension
No ratings yet
KU82 Pivoting Fact Table With Fact Dimension
1 page
10. Performance Tuning - Partitioning
No ratings yet
10. Performance Tuning - Partitioning
11 pages
C3 - Code Table Partition - 04 - 10 - 2023
No ratings yet
C3 - Code Table Partition - 04 - 10 - 2023
6 pages
Dimensions DW
No ratings yet
Dimensions DW
6 pages
Oracle Tables Defragmentation
No ratings yet
Oracle Tables Defragmentation
10 pages
Chapter-04-Analisis Dan Drfinisi Kebutuhan Datawarehouse
No ratings yet
Chapter-04-Analisis Dan Drfinisi Kebutuhan Datawarehouse
56 pages
DM MODULE 4
No ratings yet
DM MODULE 4
17 pages
Data Warehouse Implementation
No ratings yet
Data Warehouse Implementation
37 pages
dw4 - Dimension1
No ratings yet
dw4 - Dimension1
75 pages
Clustering Data Without Distance Functions
No ratings yet
Clustering Data Without Distance Functions
6 pages
Agglomerative Clustering Onvertically Partitioned Data-Distributed Database Mining
No ratings yet
Agglomerative Clustering Onvertically Partitioned Data-Distributed Database Mining
4 pages
Oracle Partitions by Fayyaz Ahmed
No ratings yet
Oracle Partitions by Fayyaz Ahmed
7 pages
Data - Warehouse - Dimensional Modeling Advanced Topics
No ratings yet
Data - Warehouse - Dimensional Modeling Advanced Topics
29 pages
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Hierarchical Clustering in Data Mining
No ratings yet
Hierarchical Clustering in Data Mining
4 pages
Studentdata 1
No ratings yet
Studentdata 1
89 pages
ERP CRM SCM SRC 16march2024
No ratings yet
ERP CRM SCM SRC 16march2024
20 pages
Regression
No ratings yet
Regression
4 pages
Syllabus OEC-CS801E
No ratings yet
Syllabus OEC-CS801E
3 pages
Goal Setting
No ratings yet
Goal Setting
15 pages
Four Cs ECommerce 28may2024 SRC
No ratings yet
Four Cs ECommerce 28may2024 SRC
6 pages
Firewall SH 16march2024 SRC
No ratings yet
Firewall SH 16march2024 SRC
7 pages
OEC-CS802A - E-Commerce - Theory Assignment - Even 2024
No ratings yet
OEC-CS802A - E-Commerce - Theory Assignment - Even 2024
8 pages
Untitled
No ratings yet
Untitled
26 pages
Review Book Dmbok DR Yutub
No ratings yet
Review Book Dmbok DR Yutub
61 pages
Class 12 CS 32 Sets QP Combined Papers
No ratings yet
Class 12 CS 32 Sets QP Combined Papers
295 pages
GIS Quiz
No ratings yet
GIS Quiz
8 pages
Sub Queries
No ratings yet
Sub Queries
4 pages
A Hybrid Cloud Approach For Secure Authorized Deduplication
No ratings yet
A Hybrid Cloud Approach For Secure Authorized Deduplication
3 pages
Karthik Mobile: +91-8885296778 Email
No ratings yet
Karthik Mobile: +91-8885296778 Email
5 pages
Book Exercise: Class-4 Computer L-3 Computer Memory and Storage
100% (3)
Book Exercise: Class-4 Computer L-3 Computer Memory and Storage
4 pages
Creditone Bank
No ratings yet
Creditone Bank
10 pages
Chapter 6
No ratings yet
Chapter 6
11 pages
HITEN_VORA_RESUME.doc
No ratings yet
HITEN_VORA_RESUME.doc
5 pages
Tafj DB Tools User Guide
100% (1)
Tafj DB Tools User Guide
69 pages
Prog 113a 48 Out of 50
No ratings yet
Prog 113a 48 Out of 50
12 pages
Creating Triggers in The NorthWind
No ratings yet
Creating Triggers in The NorthWind
10 pages
1.2 Operating System and File System
No ratings yet
1.2 Operating System and File System
32 pages
Nist SP 800-171 Quick Entry Guide
No ratings yet
Nist SP 800-171 Quick Entry Guide
3 pages
PHP Lab Program - II BCOM CA
No ratings yet
PHP Lab Program - II BCOM CA
12 pages
Advanced Database Lab Project Final
No ratings yet
Advanced Database Lab Project Final
5 pages
Advance SQL With Rajan Chettri
No ratings yet
Advance SQL With Rajan Chettri
47 pages
How It Works Bare-Metal BMR For Linux
No ratings yet
How It Works Bare-Metal BMR For Linux
23 pages
Computer Science Project File (HMS)
No ratings yet
Computer Science Project File (HMS)
39 pages
Konfig Nginx Mariadb PHP Phpmyadmin
No ratings yet
Konfig Nginx Mariadb PHP Phpmyadmin
7 pages
Stamped PDF Express
No ratings yet
Stamped PDF Express
6 pages
Untitled
No ratings yet
Untitled
6 pages
Web & Mobile Security Lab 20CSP-338: Bachelor Degree of Engineering
No ratings yet
Web & Mobile Security Lab 20CSP-338: Bachelor Degree of Engineering
11 pages
Reasons
No ratings yet
Reasons
8 pages
Lab06-Query Optimization With Indexing
No ratings yet
Lab06-Query Optimization With Indexing
2 pages