K-Centroids Diagnostics Tool

The K-Centroids Diagnostic tool is designed to allow the user to make an assessment of the appropriate number of clusters to specify given the data and the selected clustering algorithm (K-Means, K-Medians, or Neural Gas). The tool is graphical, and is based on calculating two different statistics over bootstrap replicate samples of the original data for a range of clustering solution that differ in the number of clusters specified. The motivation behind this approach is that if the records in a database truly fall into a set of stable clusters, then it should be the case that a set of different random samples of those records should result in approximately the set of clusters across the bootstrap replicates, except for small differences that are due to both random sample variability and to the randomness induced by the method used to generate the starting set of centroids, via selecting K points at random, in the general K-Centroids algorithm. The two measures examined are the adjusted Rand index and the Calinski–Harabasz index (also known as the variance ratio criteria and the pseudo-F statistic).

The adjusted Rand index provides a measure of similarity between two different clustering solutions, taking a maximum value of one when the two clustering solutions perfectly overlap.* The index can be used to determine both the relative and absolute reproducibility of a clustering solution by comparing pairs of solutions, where each pair is based on a different sample of customer data. The greater the overlap between pairs of solutions implies the greater the reproducibility of the cluster structure.

The Calinski–Harabasz index is based on comparing the weighted ratio of the between cluster sum of squares (the measure of cluster separation) and the within cluster sum of squares (the measure of how tightly packed the points are within a cluster). Ideally, the clusters should be well separated, so the between cluster sum of squares value should be large, but points within a cluster should be as close as possible to one another, resulting in smaller values of the within cluster sum of squares measure. Since the Calinski–Harabasz index is a ratio, with the between cluster sum of squares in the numerator and the within cluster sum of squares in the denominator, cluster solutions with larger values of the index correspond to “better” solutions than cluster solutions with smaller values.

The output of the tool is information about the distribution of the two statistics for differing numbers of clusters across the bootstrap replicates. The information is conveyed via two box and whisker plots (one each for the adjusted Rand index and the Calinski-Harabasz index) and summary statistics for the two measures. The preferred number of clusters based on each measure corresponds to one with the highest mean and median of the solutions compared. In addition, it is desirable that dispersion in the calculated statistics across the bootstrap replicates not be too large.

This tool can be very computationally intensive. The intensity depends on the number of records used in the calculation (which can be altered via the use of the subset expression option), the number of different clustering solutions examined (determined by range between the minimum and maximum number of clusters), the number of bootstrap replicates, and the number of different starting seeds used for each cluster solution (the number of starting seed option). Reducing the number of bootstrap replicates to use greatly reduced the amount of computer time needed, but at a large cost of precision. For actual analysis, it is recommended that the user never use fewer than 100 bootstrap replicates, and use more if possible.

This tool uses the R tool. Go to Options > Download Predictive Tools and sign in to the Alteryx Downloads and Licenses portal to install R and the packages used by the R Tool.

Configure the tool

  1. Fields (select two or more): Select the numeric fields to be used in constructing the cluster solution.
  2. Standardize the fields...: By selecting this option the user is given the choice standardizing the variables by using either a z-score or unit interval standardization.
    • The z-score transformation involves subtracting the mean value for each field from the values of the field and then divided by the standard deviation of the field. This results in a new field that has a mean of zero and a standard deviation of one.
    • The Unit interval transformation involves subtracting the minimum value of a field from the field values and then dividing by the difference between the maximum and minimum value of the field. This results in a new field that has values that range from zero to one. Clustering solutions are very sensitive to the scaling of the data, particularly if one field is on a very different scale than another. As a result, scaling the data is something that should be considered.
  3. Clustering method: Choose one of K-Means, K-Medians, or Neural Gas.
  4. Minimum number of clusters: Select the minimum number of clusters to consider in the solution.
  5. Maximum number of clusters: Select the maximum number of clusters to consider in the solution.
  6. Bootstrap replicates: The number of bootstrap replicates to use for calculating the two indices. Possible values are between 50 and 200.
  7. Number of starting seeds: K-Centroids methods start by taking randomly selected points as the initial centroids. The final solution determined by each of the methods can be influenced by the initial points. If multiple starting seeds are used, the best solution out of the set of solutions is kept as the final solution.

*en.wikipedia.org/wiki/Rand_index