You are here:

Find Nearest Neighbors Tool

The Find Nearest Neighbors tool finds the selected number of nearest neighbors in the "data" stream that corresponds to each record in the "query" stream based on their Euclidean distance. The method provides the user a choice of algorithms for finding the nearest neighbors that differ in their speed and possible accuracy. The default is to do the search based on the KD-Tree algorithm that has a generally good combination of speed and accuracy. In addition, the user has a choice of basing the calculations using either the original data or the data can be standardized using either a z-score standardization (which results in all fields having a mean of zero and a standard deviation of one) or a unit-interval transformation (in which the values of each field range from zero to one). It is recommended that some sort of field standardization be used with this tool since the Euclidean distance calculations are very sensitive to differences in field scales (e.g., untransformed household income and age data have very different levels and ranges). Given the nature of this method, only numeric fields can be used as inputs. The tool makes use of the R FNN package.

This tool uses the R programming language. Go to Options > Download Predictive Tools to install R and the packages used by the R Tool.

Inputs

Two Alteryx data streams. The right stream is the "query" stream, the rows for which the selected number of nearest neighbors in the left stream (the "data" stream)

Configuration Properties

Unique key field: A unique key is needed for this tool in order to identify the relationships between records in the query and data streams.
Fields (select two or more): Select the numeric fields to be used in constructing the cluster solution.
Standardize the fields...: By selecting this option the user is given the choice standardizing the fields by using either a z-score or unit-interval standardization.

The z-score transformation involves subtracting the mean value for each field from the values of the field and then divided by the standard deviation of the field. This results in a new field that has a mean of zero and a standard deviation of one.
The unit interval transformation involves subtracting the minimum value of a field from the field values and then dividing by the difference between the maximum and minimum value of the field. This results in a new field that has values that range from zero to one. K nearest neighbor calculations are very sensitive to the scaling of the data, particularly if one field is on a very different scale than another. As a result, scaling the data is something that should be considered.

Number of near neighbors to find: The default and minimum number is one (the nearest) near neighbor. The maximum is one hundred.
The algorithm to use for finding the nearest neighbors: Choose one of Cover Tree, KD-Tree, VR (the method used by Venables and Ripley, 2002), CR (a version of the VR algorithm based on a modified distance measure), and Linear search (which involves calculating the distance between each point in the query stream to all the points in the data stream). The methods differ in their computation time and accuracy. The default algorithm is the KD-Tree, which generally has both good computation time and accuracy. Linear search is guaranteed to find the true nearest neighbors, but has a very high computation cost.

Outputs

N Output: Consists of a table that gives the unique key value and distance to the desired number of near neighbors to each point in the query stream (identified by the unique key for each record in the query stream). If the desired number of near neighbors is two, and the unique key field name is ID, then this output data stream will have the fields ID, ID_1 (the unique keys for the closest near neighbor), Dist_1 (the Euclidean distance to the closest near neighbor), ID_2 (the unique key for the second closest near neighbor), and Dist_2 (the Euclidean distance to the second closest near neighbor).
M Output: Provides the unique key field, the standardized data values, and an indicator (the __Type__ field) of whether a record is in the data or query streams for all records from both the data and query streams.

*en.wikipedia.org/wiki/Cover_tree
**en.wikipedia.org/wiki/K-d_tree
***Venables, W. N. and Ripley, B. D. (2002), Modern Applied Statistics with S, 4th ed., Springer, Berlin.