K-Means Clustering


How it works:

  1. K centroid are created randomly (based predefined number of K) by selecting K random data as centroid
  2. K-means algorithm will allocate each data point in the data to the nearest centroid based on some distance metrics (usually euclidean distance)
  3. K-means take the mean of each data in each cluster and use that mean as the new centroid (not necessarily assigned to a single datapoint)
  4. K-means then repeat step 2 and 3 until stoping criterion is met (eg. sum of distance between data points and their centroid is minimized (intra-cluster distances), maximum number of iteration is reached, no changes in centroid value)

Elbow Method (to choose Best number of K)

one way to choose best number of K is using elbow-method

  1. run a k-means algorithm for K=1 to K=N
  2. for each value of K calculate WCSS (Within-Cluster Sum of Square) is sum of squared (SS) distance between each data point and centroid in a cluster
  3. as the number of K increase the WCSS will decrease (WCSS is largest when K equal to 1)
  4. When we analyze the graph we can see that the graph will rapidly change at a point and thus creating an elbow shape. From this point, the graph starts to move almost parallel to the X-axis. The K value corresponding to this point is the optimal K value or an optimal number of clusters.

Downside

  1. need to define number of clusters (this decision seriously affect the result)
  2. K-means produce clusters with uniform sizes (each cluster with roughly equal quantity of observation) even though the data might behave different way
  3. spherical limitation assume each data located within sphere around each centroid

when any of the assumption above is violated, K-means can behave in not so intuitive way.


References

  1. https://www.kdnuggets.com/2019/05/guide-k-means-clustering-algorithm.html