The K-Centroids Diagnostic tool is designed to allow the user to make an assessment of the appropriate number of clusters to specify given the data and the selected clustering algorithm (K-Means, K-Medians, or Neural Gas). The tool is graphical, and is based on calculating two different statistics over bootstrap replicate samples of the original data for a range of clustering solution that differ in the number of clusters specified. The motivation behind this approach is that if the records in a database truly fall into a set of stable clusters, then it should be the case that a set of different random samples of those records should result in approximately the set of clusters across the bootstrap replicates, except for small differences that are due to both random sample variability and to the randomness induced by the method used to generate the starting set of centroids, via selecting K points at random, in the general K-Centroids algorithm. The two measures examined are the adjusted Rand index and the CalinskiâHarabasz index (also known as the variance ratio criteria and the pseudo-F statistic).
The adjusted Rand index provides a measure of similarity between two different clustering solutions, taking a maximum value of one when the two clustering solutions perfectly overlap.* The index can be used to determine both the relative and absolute reproducibility of a clustering solution by comparing pairs of solutions, where each pair is based on a different sample of customer data. The greater the overlap between pairs of solutions implies the greater the reproducibility of the cluster structure.
The CalinskiâHarabasz index is based on comparing the weighted ratio of the between cluster sum of squares (the measure of cluster separation) and the within cluster sum of squares (the measure of how tightly packed the points are within a cluster). Ideally, the clusters should be well separated, so the between cluster sum of squares value should be large, but points within a cluster should be as close as possible to one another, resulting in smaller values of the within cluster sum of squares measure. Since the CalinskiâHarabasz index is a ratio, with the between cluster sum of squares in the numerator and the within cluster sum of squares in the denominator, cluster solutions with larger values of the index correspond to âbetterâ solutions than cluster solutions with smaller values.
The output of the tool is information about the distribution of the two statistics for differing numbers of clusters across the bootstrap replicates. The information is conveyed via two box and whisker plots (one each for the adjusted Rand index and the Calinski-Harabasz index) and summary statistics for the two measures. The preferred number of clusters based on each measure corresponds to one with the highest mean and median of the solutions compared. In addition, it is desirable that dispersion in the calculated statistics across the bootstrap replicates not be too large.
This tool can be very computationally intensive. The intensity depends on the number of records used in the calculation (which can be altered via the use of the subset expression option), the number of different clustering solutions examined (determined by range between the minimum and maximum number of clusters), the number of bootstrap replicates, and the number of different starting seeds used for each cluster solution (the number of starting seed option). Reducing the number of bootstrap replicates to use greatly reduced the amount of computer time needed, but at a large cost of precision. For actual analysis, it is recommended that the user never use fewer than 100 bootstrap replicates, and use more if possible.
This tool uses the R programming language. Go to Options > Download Predictive Tools to install R and the packages used by the R Tool.
An Alteryx data stream.
*en.wikipedia.org/wiki/Rand_index
©2018 Alteryx, Inc., all rights reserved. Allocate®, Alteryx®, Guzzler®, and Solocast® are registered trademarks of Alteryx,