Clustering, in data science, is a form of unsupervised learning. A type of machine learning where there is no predefined classes of data. Data classification, as I covered in my previous post, relies in knowing the data features and its classes. The “safe” or “risky” classes of data determined by a set of values in its associated data features. In unsupervised learning, the features are unknown. This is what makes data clustering a complex topic and is an active area of research and development in the field of data science, and is a key foundation in the field of artificial intelligence.
The general idea of data clustering, as the nomenclature suggests, is to cluster (or group) data based on similar characteristics so that each data feature not only share common traits but are also dissimilar in other associated features. The outcome of this grouping is a set of data classes, a process called “feature learning” – the machine learning from the data set without a known or predefined class, unsupervised.
When the feature is learned, or the classes of data identified, data classification algorithms can then be applied. Clustering can therefore be a first step in machine learning, where the data and its structure is unknown, but given a set of computational algorithms a pattern can be identified for processing such as pattern identification and classification.
As you might have imagined this is quite a complex subject with multi-dimensional facets requiring many algorithms and computational processing. However, the intuition is straightforward and the complexity lies in the details of various clustering methods and approaches, both technical and theoretical. One such clustering method is K-means clustering. A widely used clustering method and serves as a foundation to clustering.
First, K stands for (notice I did not use “means”, as in K means this… I used “stands” instead to avoid confusion) a pre-defined variable. A variable set by a clustering algorithm, or a data programmer. It represents the number of partitions or clusters we want to have for a data set with unknown classes. The means in K-Means is exactly what it says – means, or the average of the feature values. A K, for example can be 2 (a randomly selected value, or via an algorithmic process), which means that the entire data set will be clustered into two once the computation is done. Next, K number of means is randomly selected, in this case 2 random means. This mean value will represent the center of the cluster, incorrect initially given that the values were selected at random but will be corrected by iterating through the entire data set. Each data scan iteration will update the mean value and will continue to this iteration and update process until the mean value is fixed, which means that the centroid of the data set has been found.
The K-Means algorithm for partitioning a data set is as follows.
1. Arbitrarily choose k number of clusters. The number of clusters you want out of the data set.
2. Arbitrarily choose a mean value for each k. This requires some prior knowledge of the data set, statistical values such as its overall mean or range values.
3. Assign each data object to a k cluster by computing it’s nearest centroid, the arbitrarily chosen mean value of k.
dist() is the Euclidean distance of ci and x, or simply the difference between the two data points x, which is the data point and ci, the k-mean value. This value is squared and the distances are summed so that the result is as separate as possible. In 5th grade speak – squaring exaggerates these values so that they are more distinguishable for computational purposes, or visualization purposes when presented in a graph. These squared distances are then summed. argmin is the minimized value of this distance given the k-mean value in the data set, C.
4. Update the mean value point of each cluster by calculating the mean of all data points in cluster ci.
Where Si is the set of data point assignments for each of the centroid cluster, with the sum of xi computed for all elements in Si.
5. Iterate steps 3 and 4 until the cluster mean value no longer changes, and data point assignment no longer changes. This is what is called the convergence of the algorithm.
The algorithm steps I have outlined above is a much more simplified version of what you’d find in textbook and online resources. Wikipedia (https://en.wikipedia.org/wiki/K-means_clustering), for example, defines it with a 2-step process with the details defined mathematically.
The idea behind unsupervised machine learning, as in the way humans learn, is to first discover patterns in the data either by similarities by categorization or by outliers. Indeed, data clustering is a deep subject. There is significantly more aspect in data clustering. The K-Means Clustering, for example, is just one of many clustering approaches. It is a partitioning approach while others are hierarchical, density-base, or grid-based methods. The approach in this article is the organization of data by partitioning and clustering them in a way that yields meaningful associations. Granted, this article is an over-simplification of data clustering it serves to demystify the idea behind unsupervised learning, with K-Means serving as an important foundation for machine learning.
Data Mining Concepts and Techniques, 3rd Edition. By Jiawei Han, Micheline Kamber, and Jian Pei