K-Centroids Cluster Analysis Tool

K-Centroids represent a class of algorithms for doing what is known as partitioning cluster analysis. These methods work by taking the records in a database and dividing (partitioning) them into the “best” K groups based on some criteria. Nearly all the partitioning cluster analysis methods accomplish their objective by basing cluster membership on the proximity of each record to one of K points (or “centroids”) in the data. The objective of these clustering algorithms is to find the location of the centroids that optimizes some criteria with respect to the distance between the centroid of a cluster and the points assigned to that cluster for a pre-specified number of clusters in the data. The specific algorithms differ from one another in both the criteria used to define a cluster centroid and the distance measures used to define the proximity of a point in a cluster to that cluster’s centroid.

Three specific types of K-Centroids cluster analysis can be carried out with this tool: K-Means, K-Medians, and Neural Gas clustering. K-Means uses the mean value of the fields for the points in a cluster to define a centroid, and Euclidean distances are used to measure a point’s proximity to a centroid.* K-Medians uses the median value of the fields for the points in a cluster to define a centroid, and Manhattan (also called city-block) distance is used to measure proximity.** Neural Gas clustering is similar to K-Means in that it uses the Euclidean distance between a point and the centroids to assign that point to a particular cluster.*** . However, the method differs from K-Means in how the cluster centroids are calculated, with the location of the centroid for a cluster involving a weighted average of all data points, with the points assigned to the cluster for which the centroid is being constructed receiving the greatest weight, points from the most distant cluster from the focal cluster receiving the lowest weight, and the weights given to points in intermediate clusters decreasing as the distance between the focal cluster and the cluster to which a point is assigned increases.

This tool uses the R tool. Go to Options > Download Predictive Tools and sign in to the Alteryx Downloads and Licenses portal to install R and the packages used by the R Tool.

Configure the tool

Use the Configuration tab to set the controls for the cluster analysis.

  1. Solution name: Each cluster solution needs to be given a name so it can be identified later. Solution names must start with a letter and may contain letters, numbers, and the special characters period (".") and underscore ("_"). No other special characters are allowed, and R is case sensitive.
  2. Fields (select two or more): Select the numeric fields to be used in constructing the cluster solution.
  3. Standardize the fields...: By selecting this option the user is given the choice standardizing the variables by using either a z-score or unit interval standardization.
    • The z-score transformation involves subtracting the mean value for each field from the values of the field and then divided by the standard deviation of the field. This results in a new field that has a mean of zero and a standard deviation of one.
    • The Unit interval transformation involves subtracting the minimum value of a field from the field values and then dividing by the difference between the maximum and minimum value of the field. This results in a new field that has values that range from zero to one. Clustering solutions are very sensitive to the scaling of the data, particularly if one field is on a very different scale than another. As a result, scaling the data is something that should be considered.
  4. Clustering method: Choose one of K-Means, K-Medians, or Neural Gas.
  5. Number of clusters: Select the number of clusters in the solution.
  6. Number of starting seeds: K-Centroids methods start by taking randomly selected points as the initial centroids. The final solution determined by each of the methods can be influenced by the initial points. If multiple starting seeds are used, the best solution out of the set of solutions is kept as the final solution.

Use the Plot Options tab to set the controls for the plot.

  1. Plot points: If checked, all points in the data will be plotted, and represented by the cluster number each point is assigned to in the solution.
  2. Plot centroids: If checked, cluster centroids will be plotted, and represented by the number of the cluster for which it is the centroid.
  3. The highest number of dimensions to include in biplots: A biplot is a means of visualizing a clustering solutions (via principal components) in a smaller dimensional space. The dimension is done two dimensions at a time. This option sets the upper limit of the dimensions to use in the visualization. For example, if this parameter is set to "3", then biplots will include the first and second, first and third, and second and third principal components in three separate figures.

Use the Graphics Options tab to set the controls for the output.

  • Plot size: Select inches or centimeters for the size of the graph.
  • Graph resolution: Select the resolution of the graph in dots per inch: 1x (96 dpi); 2x (192 dpi); or 3x (288 dpi). Lower resolution creates a smaller file and is best for viewing on a monitor. Higher resolution creates a larger file with better print quality.

  • Base font size (points): Select the size of the font in the graph.

Output

Connect a Browse tool to each output anchor to view results.

  • O anchor: consists of a table of the serialized model with model name and the size of the object.
  • R anchor: consists of the report snippets generated by the K-Centroids Cluster Analysis Tool: a statistical summary and cluster solution plots.

*en.wikipedia.org/wiki/K-means_clustering
**en.wikipedia.org/wiki/K-medians_clustering
***en.wikipedia.org/wiki/Neural_gas