Data Mining With K Means Clustering

The k-means algorithm is one of the simplest clustering techniques and it is commonly used in medical imaging, biometrics, and related fields. The advantage of k-means clustering is that it tells about your data (using its unsupervised form) rather than you having to instruct the algorithm about the data at the start (using the supervised form of the algorithm). It is sometimes referred to as Lloyd’s Algorithm, particularly in computer science circles because the standard algorithm was first proposed by Stuart Lloyd in 1957. The term “k-means” was coined in 1967 by James McQueen.

How the K-Means Algorithm Functions

The k-means algorithm is an evolutionary algorithm that gains its name from its method of operation. The algorithm clusters observations into k groups, where k is provided as an input parameter. It then assigns each observation to clusters based upon the observation’s proximity to the mean of the cluster. The cluster’s mean is then recomputed and the process begins again. Here’s how the algorithm works:

Choosing the Number of Clusters

One of the main disadvantages to k-means clustering is the fact that you must specify the number of clusters as an input to the algorithm. As designed, the algorithm is not capable of determining the appropriate number of clusters and depends upon the user to identify this in advance. For example, if you had a group of people that are to be clustered based upon binary gender identity as male or female, calling the k-means algorithm using the input k=3 would force the people into three clusters when only two, or an input of k=2, would provide a more natural fit. Similarly, if a group of individuals was easily clustered based upon home state and you called the k-means algorithm with the input k=20, the results might be too generalized to be effective. For this reason, it’s often a good idea to experiment with different values of k to identify the value that best suits your data. You also may wish to explore the use of other data mining algorithms in your quest for machine-learned knowledge.

How the K-Means Algorithm Functions#

Choosing the Number of Clusters#

How the K-Means Algorithm Functions

Choosing the Number of Clusters