The Spline Model tool provides the multivariate adaptive regression splines (or MARS) algorithm of Friedman.* This method is a modern statistical learning model that: (1) self-determines which subset of fields best predict a target field of interest; (2) is able to capture highly nonlinear relationships and interactions between fields; and can automatically address a broad range of regression and classification problems in a way that can be transparent to the user (the user can do as little as specify a target field and a set of predictor fields, but the tool can be extensively fine-tuned by advanced users).
The tool is applicable to a wide range of problems, such as classification, count data, and continuous target regression problems. The method uses a two-step approach to develop a model. In the first step (known as the forward pass, and is similar to the recursive partitioning algorithm used in the Decision Tree tool) the variables that matter most in predicting the target determined, and appropriate "split points" (known as "knots") in the variables are found. However, unlike in a decision tree, a line between adjacent knots (called a term) is fit rather than using discrete jumps as is done in decision trees. This results in a the construction of a piecewise linear function for each variable that can closely approximate any relationship between the target and the predictor variables. The second step (known as the backward or pruning pass) removes some of the knots in the variables (elongating the line segment in the remaining terms) in order to minimize the chance that the model is overfitting the estimation sample and capturing estimation sample noise as opposed to the underlying signal.
This tool uses the R programming language. Go to Options > Download Predictive Tools to install R and the packages used by the R Tool.
An Alteryx data stream that includes a target field of interest along with one or more possible predictor fields.
Model name: Each model needs to be given a name so it can later be identified. Model names must start with a letter and may contain letters, numbers, and the special characters period (".") and underscore ("_"). No other special characters are allowed, and R is case sensitive.
Select the target field: Select the field from the data stream you want to predict.
Select the predictor fields: Choose the fields from the data stream you believe "cause" changes in the value of the target variable.
Columns containing unique identifiers, such as surrogate primary keys and natural primary keys, should not be used in statistical analyses. They have no predictive value and can cause runtime exceptions.
Include effects plots?: If checked, then effects plots will be produced that graphically show the relationship between the predictor variable and the target at fixed levels (the median for numeric predictors, the first level for factors) of other predictor fields. There are options to display only the fields that have a main effect on the target, only the two-way interaction effects between fields using a perspective plot, or both the main effects and the two-way interactions.
Specify target type and the GLM family: There are five types of target fields supported:
Each type of target field can have one or more possible associated distribution functions (which is related to the measure the algorithm is attempting to minimize).
Scale the target variable: If the target variable is a continuous variable, and this option is selected, then it will be subjected to a z-score (mean zero, standard deviation of one) transformation to help with numeric stability in the forward pass (first stage) of the algorithm.
The maximum number of knots or determine automatically (Auto): This option controls the number of possible knots (splits) in the predictor fields in the forward pass (first stage) of the algorithm. If "Auto" is selected, the number of knots is calculated based on the number of predictor fields. The actual number of knots in the forward pass will often be less than the maximum allowed.
Interaction depth: The level of interaction between predictor fields.
Penalty per term or knot: The function to be optimized contains a penalty component to decrease the possibility that the final model overfits the estimation sample data. The default is a value of 2 for a main effects only model, and 3 if two-way or higher interactions are allowed. A value of -1 results in no penalty for knots or terms being applied, while a value 0 applies the default penalty only to terms.
The minimal improvement in R-Squared needed to add an additional knot: The higher the value of this terms, the higher the improvement in R-Squared needs to be for the model in order to add a knot.
The minimum distance between knots: If 0 is selected then the minimum distance allowed is calculated based on a formula, a value of 1 allows any of the values of a predictor variable to be a knot (which only works well if the data has very little noise), otherwise a numeric value between 0 and 1 needs to be provided which gives the distance as a percentage of the range of a predictor variable.
New variable penalty: The additional penalty term appended to the objective function for the addition of a new variable to the model. The default is 0 (none) and this value can range from 0 to 0.5. As with the penalty per knot or term, the purpose is to control for overfitting
The maximum number of parent terms considered at each step in the forward pass: This term controls the number of terms that are created in a forward pass, which can speed up execution. A special value of 0 places no limits on the term, while a number greater than 0 specifies the maximum number of terms. The default is 20 terms, common values for this are 20, 10, and 5.
The fast MARS aging coefficient: See Section 3.1 of Friedman (1993) for an explanation of this parameter.***
Perform a cross validation analysis: This option allows for a cross validation assessment as to whether sufficient pruning has taken place relative to the generalized cross-validation method used by the algorithm. If this option is selected, then the user can specify the number of separate cross validation runs, the number of folds in each cross validation run, whether the cross validation samples are stratified to have a comparable mix of responses for a categorical target (e.g., a comparable number of "yes" and "no" responses for a binary categorical variable), and the random seed value for the random numbers generated to create the samples.
The pruning method: The choices are "Backward elimination", "None", "Exhaustive search", "Forward selection", and "Sequential replacement".
The maximum number of terms in the pruned model: If 0 is selected (the default) then all terms that remain after the other criteria used in the pruning pass are applied are used in the final model, otherwise, only the most important terms up to the selected number are retained in the final model.
Graph resolution: Select the resolution of the graph in dots per inch: 1x (96 dpi); 2x (192 dpi); or 3x (288 dpi). Lower resolution creates a smaller file and is best for viewing on a monitor. Higher resolution creates a larger file with better print quality.
O Output: Consists of a table of the serialized model with its model name.
R Output: Consists of the report snippets generated by the Spline Model tool: a basic model summary, a Variable Importance Plot (which indicates the relative importance of the different predictor fields), a Basic Model Diagnostics Plot, and (optionally) the Effects Plots.
*https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines
**Freidman, Jerome H., "Multivariate Adaptive Regression Splines", Stanford University, August 1990
©2018 Alteryx, Inc., all rights reserved. Allocate®, Alteryx®, Guzzler®, and Solocast® are registered trademarks of Alteryx, Inc.