K-Centroids represent a class of algorithms for doing what is known as partitioning cluster analysis. These methods work by taking the records in a database and dividing (partitioning) them into the âbestâ K groups based on some criteria. Nearly all the partitioning cluster analysis methods accomplish their objective by basing cluster membership on the proximity of each record to one of K points (or âcentroidsâ) in the data. The objective of these clustering algorithms is to find the location of the centroids that optimizes some criteria with respect to the distance between the centroid of a cluster and the points assigned to that cluster for a pre-specified number of clusters in the data. The specific algorithms differ from one another in both the criteria used to define a cluster centroid and the distance measures used to define the proximity of a point in a cluster to that clusterâs centroid.
Three specific types of K-Centroids cluster analysis can be carried out with this tool: K-Means, K-Medians, and Neural Gas clustering. K-Means uses the mean value of the fields for the points in a cluster to define a centroid, and Euclidean distances are used to measure a pointâs proximity to a centroid.* K-Medians uses the median value of the fields for the points in a cluster to define a centroid, and Manhattan (also called city-block) distance is used to measure proximity.** Neural Gas clustering is similar to K-Means in that it uses the Euclidean distance between a point and the centroids to assign that point to a particular cluster.*** . However, the method differs from K-Means in how the cluster centroids are calculated, with the location of the centroid for a cluster involving a weighted average of all data points, with the points assigned to the cluster for which the centroid is being constructed receiving the greatest weight, points from the most distant cluster from the focal cluster receiving the lowest weight, and the weights given to points in intermediate clusters decreasing as the distance between the focal cluster and the cluster to which a point is assigned increases.
This tool uses the R programming language. Go to Options > Download Predictive Tools to install R and the packages used by the R Tool.
Input
An Alteryx data stream.
Configuration Properties
Solution name: Each cluster solution needs to be given a name so it can be identified later. Solution names must start with a letter and may contain letters, numbers, and the special characters period (".") and underscore ("_"). No other special characters are allowed, and R is case sensitive.
Fields (select one or more): Select the numeric fields to be used in constructing the cluster solution.
Standardize the fields...: By selecting this option the user is given the choice standardizing the variables by using either a z-score or unit interval standardization.
The z-score transformation involves subtracting the mean value for each field from the values of the field and then divided by the standard deviation of the field. This results in a new field that has a mean of zero and a standard deviation of one.
The unit interval transformation involves subtracting the minimum value of a field from the field values and then dividing by the difference between the maximum and minimum value of the field. This results in a new field that has values that range from zero to one. Clustering solutions are very sensitive to the scaling of the data, particularly if one field is on a very different scale than another. As a result, scaling the data is something that should be considered.
Number of clusters: Select the number of clusters in the solution.
Number of starting seeds: K-Centroids methods start by taking randomly selected points as the initial centroids. The final solution determined by each of the methods can be influenced by the initial points. If multiple starting seeds are used, the best solution out of the set of solutions is kept as the final solution.
Plot Options
Plot points: If checked, all points in the data will be plotted, and represented by the cluster number each point is assigned to in the solution.
Plot centroids: If checked, cluster centroids will be plotted, and represented by the number of the cluster for which it is the centroid.
The highest number of dimensions to include in biplots: A biplot is a means of visualizing a clustering solutions (via principal components) in a smaller dimensional space. The dimension is done two dimensions at a time. This option sets the upper limit of the dimensions to use in the visualization. For example, if this parameter is set to "3", then biplots will include the first and second, first and third, and second and third principal components in three separate figures.
Graphics Options
Plot size: Specify the width and height dimensions of the resulting plot, using either inches or centimeters.
Graph resolution: The resolution (in dots per inch) of any plot(s) produced by the macro. The choices are:
1x (96 dpi)
2x (192 dpi)
3x (288 dpi)
The 1x resolution is best for reports intended to be viewed exclusively on a computer screen (e.g., HTML reports), while 3x resolution will be best for PDF files or formats intended to be printed. In some cases, 3x resolution is set to 576 dpi to improve plot clarity.
Base font size (points): The point size of the base font used to produce the title and labels of the plot(s) to be produced. The plotting functions will expand the size of the plot title to be larger than the base font automatically.
Output
There are 2 output streams:
O Output: consists of a table of the serialized model with model name and the size of the object.
R Output: consists of the report snippets generated by the K-Centroids Cluster Analysis Tool: a statistical summary and cluster solution plots.