clustering
- group together data
- divide data into clusters
- kmeans
- dbscan
--------------------------
row = record = tuple = observation = instance = datapoint
col1 col2 col3
12 24 20 G2 Dp1
14 10 20 G1 Dp2
17 18 25 G2 Dp3
17 24 10* Cr1(G1) Dp4
11 10 10 G1
12 28 15 G1
18 24 10 G2
12 31 10 G2
15 12 35* Cr2 (G2)
14 23 30 G1
19 50 10 G2
11 21 15 G1
k=2
- - divide above records
into k groups
1. pick any k rows as centers
randomly - Cr1, Cr2
2. find dist between Dp1 to Cr1
and Dp1 to Cr2
d1 = (12-17)**2 + (24-24)**2 + (20-10)**2
dp1 to G1
d2 = (12-15)**2 + (24-12)**2 + (20-35)**2
dp1 to G2
5 - 9 = 4
4,5 7,9
x1,y1 x2,y2
sqrt( (x1-x2)**2 + (y1-y2)**2 )
if a > b is true
will a**2 > b**2 be true
sq5 > sq3
5 > 3
5**2 > 3**2
x1,y1,z1 x2,y2,z2
sqrt( (x1-x2)**2 + (y1-y2)**2 + (z1-z2)**2 )
3. assign Dp1 to particular Grp
to which it is close (distance is lowest)
4. repeat steps 2 and 3
for all datapoints
col1 col2 col3
11 21 15 G1
14 10 20 G1 Dp2
17 24 10* Cr1(G1) Dp4
11 10 10 G1
12 28 15 G1 (NewCr1)
14 23 30 G1
15 12 35* Cr2 (G2)
19 50 10 G2
12 24 20 G2 Dp1
17 18 25 G2 Dp3
18 24 10 G2
12 31 10 G2 (newCr2)
18 24 10
- + - + -
12 31 11
** ** **
2 2 2
5. find the new center of each Group
by doing mean/avg operation on each group
for G1=> new center is
(11+14+17+11+12+14) / 6 12
(21+10+24+10+28+23) / 6 27
(15+20+10+10+15+30) / 6 15
for G2=> new center is :
6. repeat steps 2,3,4,5
again and again
until centers are not changing
and datapoints are not changing
=============== ============== K-MEANS ============== =================
N - 100 data points
d1 d2 d3 d4 d5 ... d100
k - 3 number of clusters
pick k - centroids - random
d1 d2 d3
d4 to d1 5
d4 to d2 7
d4 to d3 3
d4 belongs to d3
d5 to d1 3
d5 to d2 7
d5 to d3 5
d5 belongs to d1
d6
...
d100
-----------------------------------
c1 c2 c3
d1 d3* d9.. d2 d5 d7.. d4 d6 d8 ...
calc mean of
c1 data points