K-Means Clustering Algorithm is used for dividing given dataset into k datasets, having similar properties.

To begin with, let’s say that we have this dataset containing 200 two-dimensional points and we want to partition it into *k* smaller sets, containing points close to each other.

1.97; 2.77

4.40; 2.67

2.93; 0.96

2.24; 3.53

0.99; 2.61

4.47; 3.42

0.07; 1.12

2.76; 4.26

2.26; 2.01

3.31; 2.97

3.71; 4.33

0.80; 3.35

1.40; 0.76

1.91; 0.72

1.48; 0.84

0.74; 1.55

0.49; 0.94

4.43; 0.69

2.40; 1.71

3.39; 2.97

1.87; 1.43

4.23; 1.55

0.94; 3.97

3.19; 4.12

1.49; 3.45

4.30; 1.55

3.32; 3.93

3.89; 1.09

3.48; 3.68

0.28; 1.64

4.33; 0.68

3.24; 1.40

3.97; 3.39

1.17; 1.91

2.23; 4.45

0.86; 2.99

4.33; 1.79

0.77; 0.99

2.58; 3.82

2.80; 1.76

2.47; 1.80

0.58; 4.33

3.81; 2.52

4.37; 1.14

3.53; 0.77

0.56; 3.65

0.61; 2.16

1.19; 3.85

2.05; 0.08

3.95; 2.24

6.05; 2.22

9.13; 1.76

6.94; 1.49

8.76; 4.12

6.62; 0.53

6.09; 1.79

9.53; 3.96

5.82; 2.46

7.64; 0.97

8.59; 3.54

9.40; 0.26

7.00; 1.97

6.86; 3.65

9.20; 4.22

7.48; 0.17

8.69; 0.60

5.68; 1.33

7.71; 2.07

6.19; 3.96

8.69; 1.49

7.82; 1.73

9.44; 0.91

6.30; 0.16

9.51; 3.85

7.75; 1.55

7.55; 1.55

6.94; 2.65

8.66; 4.15

6.47; 2.87

7.72; 2.66

9.65; 3.07

6.67; 4.16

9.04; 1.70

8.27; 0.15

6.21; 0.23

5.58; 0.34

5.83; 0.02

9.41; 0.43

7.17; 4.44

7.98; 3.11

9.44; 4.26

8.00; 3.03

7.33; 1.80

9.50; 4.09

6.69; 3.67

9.12; 3.65

8.96; 3.89

9.80; 0.44

5.73; 0.66

8.73; 4.41

6.82; 6.92

7.06; 8.26

9.30; 7.09

5.52; 6.04

9.07; 9.42

8.71; 6.89

6.74; 7.62

7.62; 8.23

5.78; 9.56

7.16; 6.51

5.64; 8.43

8.32; 6.31

6.76; 9.75

6.78; 5.79

8.96; 8.01

7.10; 9.67

9.24; 8.42

9.66; 7.31

8.76; 6.78

7.46; 8.12

7.14; 7.62

7.94; 6.23

8.14; 7.13

9.47; 9.50

9.10; 7.53

9.65; 5.56

8.23; 6.44

6.36; 9.26

9.79; 7.01

8.39; 7.47

5.86; 5.82

6.37; 5.74

6.36; 6.71

8.23; 7.68

8.18; 6.78

8.87; 7.43

9.37; 6.68

8.76; 6.33

9.95; 8.92

8.45; 6.49

9.38; 9.34

7.05; 6.58

5.77; 9.24

5.89; 7.31

7.08; 8.75

9.20; 6.88

9.55; 8.85

8.69; 6.65

7.53; 9.91

6.11; 7.26

0.21; 6.45

2.91; 8.78

0.33; 6.85

1.24; 9.32

4.37; 8.82

0.63; 7.03

3.54; 9.06

1.73; 9.55

0.31; 9.69

0.77; 5.52

3.68; 9.47

4.22; 5.54

1.33; 9.94

2.94; 7.20

3.27; 8.79

2.05; 8.44

3.32; 6.14

0.10; 9.08

0.75; 6.86

0.50; 9.76

2.51; 9.95

2.38; 8.28

4.14; 8.96

1.68; 7.75

2.44; 9.32

0.90; 7.75

1.00; 7.99

0.01; 6.06

2.86; 8.60

1.63; 9.43

1.27; 5.73

2.85; 8.59

1.15; 5.81

3.55; 6.09

0.63; 7.43

2.33; 9.54

1.01; 6.77

0.63; 8.32

1.23; 8.21

4.40; 9.19

0.29; 7.63

1.90; 5.69

0.89; 5.84

3.74; 7.20

0.63; 6.72

2.09; 7.46

4.43; 8.78

0.95; 6.84

3.08; 8.69

0.10; 9.49

After clustering, groups should look like this:

As can be seen in the plot groups and points in the center of them formed a *Voronoi** Diagram*.

Partitioning can be done with the K-Means Clustering algorithm:

IN N-Dimensional real values dataset, non-negative integer k, greater than 1 (dataset is already partitioned into one set) - representing the number of clusters. OUT K clusters. 1. Create and initialize (using one of the initialization methods) k n-dimensional points named centroids. 2. For each of points in dataset find centroid which is closest to it and put that point in cluster representing it. 3. For each of clusters calculate their mean points and assign value of corresponding centroid to it. 4. Check if centroids have moved (have different values, than in previous iteration). 5. If they have moved go to step 2. If not end algorithm.

Before using this algorithm, we should first explain some things:

**Cluster**– set of points close to each other (having similar values).**Centroid**– point being the mean value of all points (point in the center) of corresponding cluster.

It is one of the simplest *Clustering* algorithms, yet very powerful.

**Distance**

It can be calculated in different ways, the one most commonly used is the Minkowski Distance of order 2, which is basically Euclidean Distance. Squared Euclidean Distance(Euclidean distance without square root) is also often used. For experimental purposes try Minkowski Distance using different distance orders:

**Distance Order 1 – Taxicab Distance:**

**Distance Order 2 – Euclidean Distance:**

**Distance Order x:**

where *p *and *q *are points in *n* dimensional space.

### Initialization Methods

In all initialization methods it is important to make sure that centroids are not very close to each other, otherwise there might be one or more **empty clusters**.

**Predefined Points**– centroids are predefined points (points can be chosen by running different initialization methods or from observation), it works best for constant known datasets.**Random Points**– centroids values are assigned randomly from chosen interval – the better chosen, the faster the algorithm will finish. It can be slightly upgraded by choosing interval beginning as the minimal value from the dataset and interval end as the maximal value from the dataset.**Random Partition**– points from dataset are randomly assigned into clusters.**Forgy**– centroids are randomly chosen from points in dataset. It is important to assure, that no point from dataset can be assigned simultaneously to two or more different centroids. For example dataset can be divided into k smaller sets and then points can be selected from them.**Biased Forgy**– similar to**Forgy**, with a difference, that only*n*points farthest from dataset centre(mean point) can be selected as centroids.**Centroids Of Random Sub-samples**– centroids are calculated by taking n points from dataset and calculating their mean. It can be very inefective (points can be very close), unless points are taken carefully – for example from pre-selected groups.**Farthest Points Simple Selection**– first of centroids is randomly selected from dataset, second one is selected as the point farthest from first centroid, every next point is choosen as point farthest from all previous centroids (calculated from the point closest to previous centroids).**Farthest Points Running Selection**– in the beginning centroids are taken from first points in dataset and then they are assigned different points from dataset lying as far from each other as possible. The goal is to have centroids placed as far from each other as it can be.**K-Means++**– first of centroids is randomly selected from dataset, second one is selected randomly from dataset with chance of selection proportional to the distance (square Euclidean distance) to the first centroid, the farther the greater chance it has to be selected. Every next point is choosen similarly to the second one – with greater probability for points farther from point closest to previous centroids.**Group Representative Points**– before assigning centroids values, calculate mean point of the dataset, after that find point closest to it and assign it’s value to the first of centroids. Every next centroid is choosen from dataset points in a way to be close to group of points, which are not close yet to any of previous centroids (square Euclidean distance).

Note that initialization methods, which involve random element will produce different results each time the algorithm runs. It is best to try different initialization methods for the same dataset and then pick one, which works best.

### Determining K

There are three basic ways to determine number of clusters(k):

**K Needed**– dataset needs to be divided into*k*clusters – we want to have*k*clusters.**Observation**– after observing the dataset it is believed, that*k*is the right amount of clusters to divide it into.**Elbow Method**– described below.

#### Elbow Method

Suppose that we want to determine best *k* from a chosen interval. Each time after clustering the same dataset with different *k, Sum Of Square Errors* is calculated using this algorithm:

For each of clusters calculate sum of distances between each points in it and centoid assigned to it. Square each of calculated sums. Add squared sums together.

After calculating it, *k *should be chosen in a way, so adding another cluster does not greatly improve or worse the *SSE*. Or using simple words ideal * k*, as the name of the method suggests, should be taken as the point(*k*, *SSE*) lying on the elbow (look at the plot in the example – below).

**For Example:**

For the dataset presented in the beginning of the article the *SSE* was calculated for *k* in the range of [2, 9]. It was calculated 5 times for each of *k *(not counting unsuccessful runs – empty clusters) and averaged (**Forgy **was used as initialization method, that is the reason, why it was calculated 5 times – to be more accurate).

## Elbow Method

k | Average SSE |
---|---|

2 | 197518.9633510644 |

3 | 81437.48287746631 |

4 | 30924.831291238181 |

5 | 24388.24127790040 |

6 | 19080.48185258347 |

7 | 15324.033991149317 |

8 | 10569.739302834429 |

9 | 9224.895428024421 |

*k = 4*is optimal for this dataset. Choosing lower

*k*greatly increases the error, also choosing bigger

*k*does not significantly lower the error. It is also worth mentioning that, the higher

*k is*, the greater the probability of getting empty clusters is.

### Algorithm Step By Step

Suppose, we have choosen **Forgy** as initialization method, *k* to be 4 and distance to be calculated as Euclidean Distance.

Points vere divided as expected: forming 4 groups, each containing exactly 50 points, which are close to each other. After 20 tries it took 7.8 iterations on average to cluster the dataset, minimal number was 4, maximal was 13, it is due to the fact, that **Forgy** initialization method works on a random basis.

### Algorithm Applications

Algorithm can be used for:

- Finding trends in data – clusters represent similarities,
- Reducing color palette of images,
- Sorting items,
- More.