K-Centroids Cluster Analysis Tool

K-Centroids represent a class of algorithms for doing what is known as partitioning cluster analysis. These methods work by taking the records in a database and dividing (partitioning) them into the “best” K groups based on some criteria. Nearly all the partitioning cluster analysis methods accomplish their objective by basing cluster membership on the proximity of each record to one of K points (or “centroids”) in the data. The objective of these clustering algorithms is to find the location of the centroids that optimizes some criteria with respect to the distance between the centroid of a cluster and the points assigned to that cluster for a pre-specified number of clusters in the data. The specific algorithms differ from one another in both the criteria used to define a cluster centroid and the distance measures used to define the proximity of a point in a cluster to that cluster’s centroid.

Three specific types of K-Centroids cluster analysis can be carried out with this tool: K-Means, K-Medians, and Neural Gas clustering. K-Means uses the mean value of the fields for the points in a cluster to define a centroid, and Euclidean distances are used to measure a point’s proximity to a centroid.* K-Medians uses the median value of the fields for the points in a cluster to define a centroid, and Manhattan (also called city-block) distance is used to measure proximity.** Neural Gas clustering is similar to K-Means in that it uses the Euclidean distance between a point and the centroids to assign that point to a particular cluster.*** . However, the method differs from K-Means in how the cluster centroids are calculated, with the location of the centroid for a cluster involving a weighted average of all data points, with the points assigned to the cluster for which the centroid is being constructed receiving the greatest weight, points from the most distant cluster from the focal cluster receiving the lowest weight, and the weights given to points in intermediate clusters decreasing as the distance between the focal cluster and the cluster to which a point is assigned increases.

This tool uses the R programming language. Go to Options > Download Predictive Tools to install R and the packages used by the R Tool.

Input

An Alteryx data stream.

Configuration Properties

  1. Solution name: Each cluster solution needs to be given a name so it can be identified later. Solution names must start with a letter and may contain letters, numbers, and the special characters period (".") and underscore ("_"). No other special characters are allowed, and R is case sensitive.
  2. Fields (select one or more): Select the numeric fields to be used in constructing the cluster solution.
  3. Standardize the fields...: By selecting this option the user is given the choice standardizing the variables by using either a z-score or unit interval standardization.
  4. Clustering method: Choose one of K-Means, K-Medians, or Neural Gas.
  5. Number of clusters: Select the number of clusters in the solution.
  6. Number of starting seeds: K-Centroids methods start by taking randomly selected points as the initial centroids. The final solution determined by each of the methods can be influenced by the initial points. If multiple starting seeds are used, the best solution out of the set of solutions is kept as the final solution.

Plot Options

  1. Plot points: If checked, all points in the data will be plotted, and represented by the cluster number each point is assigned to in the solution.
  2. Plot centroids: If checked, cluster centroids will be plotted, and represented by the number of the cluster for which it is the centroid.
  3. The highest number of dimensions to include in biplots: A biplot is a means of visualizing a clustering solutions (via principal components) in a smaller dimensional space. The dimension is done two dimensions at a time. This option sets the upper limit of the dimensions to use in the visualization. For example, if this parameter is set to "3", then biplots will include the first and second, first and third, and second and third principal components in three separate figures.

Graphics Options

  1. Plot size: Specify the width and height dimensions of the resulting plot, using either inches or centimeters.
  2. Graph resolution: The resolution (in dots per inch) of any plot(s) produced by the macro. The choices are:
  3. The 1x resolution is best for reports intended to be viewed exclusively on a computer screen (e.g., HTML reports), while 3x resolution will be best for PDF files or formats intended to be printed. In some cases, 3x resolution is set to 576 dpi to improve plot clarity.

  4. Base font size (points): The point size of the base font used to produce the title and labels of the plot(s) to be produced. The plotting functions will expand the size of the plot title to be larger than the base font automatically.

Output

There are 2 output streams:

*en.wikipedia.org/wiki/K-means_clustering
**en.wikipedia.org/wiki/K-medians_clustering
***en.wikipedia.org/wiki/Neural_gas