Find Nearest Neighbors Tool
The Find Nearest Neighbors tool finds the selected number of nearest neighbors in the "data" stream that corresponds to each record in the "query" stream based on their Euclidean distance. The method provides you with a choice of algorithms for finding the nearest neighbors that differ in their speed and possible accuracy. The default is to do the search based on the KD-Tree algorithm that has a generally good combination of speed and accuracy. In addition, you have a choice of basing the calculations using either the original data or the data can be standardized using either a z-score standardization (which results in all fields having a mean of 0 and a standard deviation of 1) or a unit-interval transformation (in which the values of each field range from 0 to 1.)
It is recommended that some sort of field standardization be used with this tool since the Euclidean distance calculations are very sensitive to differences in field scales (for example, untransformed household income and age data have very different levels and ranges). Given the nature of this method, only numeric fields can be used as inputs. The tool makes use of the R FNN package.
This tool uses the R tool. Go to Options > Download Predictive Tools and sign in to the Alteryx Downloads and Licenses portal to install R and the packages used by the R tool. Go to Download and Use Predictive Tools for more information.
The tool accepts 2 Alteryx data streams:
- D anchor: Accepts the "data" stream. The tool finds the selected number of nearest neighbors in the data stream that corresponds to each record in the query stream (Q input.)
- Q anchor: Accepts the "query" stream.
Configure the Tool
- Unique key field: A unique key is needed for this tool in order to identify the relationships between records in the query and data streams.
- Fields (select two or more): Select the numeric fields to use in constructing the cluster solution.
- Standardize the fields...: Select this option to choose to standardize the fields via either a z-score or unit-interval standardization.
- z-score standardization: The z-score transformation involves subtracting the mean value for each field from the values of the field and then divided by the standard deviation of the field. This results in a new field that has a mean of zero and a standard deviation of one.
- Unit-interval standardization: The unit interval transformation involves subtracting the minimum value of a field from the field values and then dividing by the difference between the maximum and minimum value of the field. This results in a new field that has values that range from zero to one. K nearest neighbor calculations are very sensitive to the scaling of the data, particularly if one field is on a very different scale than another. As a result, scaling the data is something that should be considered.
- The number of near neighbors to find: The default (and minimum) number is 1 (the nearest) near neighbor. The maximum is 100.
- The algorithm to use for finding the nearest neighbors: The methods differ in their computation time and accuracy. The default algorithm is the KD-Tree, which generally has both good computation time and accuracy. Linear search is guaranteed to find the true nearest neighbors but has a very high computation cost. Choose one of...
View the Output
- N anchor: Consists of a table that gives the unique key value and distance to the desired number of near neighbors to each point in the query stream (identified by the unique key for each record in the query stream.) If the desired number of near neighbors is 2, and the unique key field name is ID, then this output data stream has the fields ID, ID_1 (the unique keys for the closest near neighbor), Dist_1 (the Euclidean distance to the closest near neighbor), ID_2 (the unique key for the second closest near neighbor), and Dist_2 (the Euclidean distance to the second closest near neighbor.)
- M anchor: Provides the unique key field, the standardized data values, and an indicator (the __Type__ field) of whether a record is in the data or query streams for all records from both the data and query streams.