How it works:
- K centroid are created randomly (based predefined number of K) by selecting K random data as centroid
- K-means algorithm will allocate each data point in the data to the nearest centroid based on some distance metrics (usually euclidean distance)
- K-means take the mean of each data in each cluster and use that mean as the new centroid (not necessarily assigned to a single datapoint)
- K-means then repeat step 2 and 3 until stoping criterion is met (eg. sum of distance between data points and their centroid is minimized (intra-cluster distances), maximum number of iteration is reached, no changes in centroid value)
Elbow Method (to choose Best number of K)
one way to choose best number of K is using elbow-method
- run a k-means algorithm for K=1 to K=N
- for each value of K calculate WCSS (Within-Cluster Sum of Square) is sum of squared (SS) distance between each data point and centroid in a cluster
- as the number of K increase the WCSS will decrease (WCSS is largest when K equal to 1)
- When we analyze the graph we can see that the graph will rapidly change at a point and thus creating an elbow shape. From this point, the graph starts to move almost parallel to the X-axis. The K value corresponding to this point is the optimal K value or an optimal number of clusters.
Downside
- need to define number of clusters (this decision seriously affect the result)
- K-means produce clusters with uniform sizes (each cluster with roughly equal quantity of observation) even though the data might behave different way
- spherical limitation assume each data located within sphere around each centroid
when any of the assumption above is violated, K-means can behave in not so intuitive way.