Spline Model Tool

The Spline Model tool provides the multivariate adaptive regression splines (or MARS) algorithm of Friedman.* This method is a modern statistical learning model that: (1) self-determines which subset of fields best predict a target field of interest; (2) is able to capture highly nonlinear relationships and interactions between fields; and can automatically address a broad range of regression and classification problems in a way that can be transparent to the user (the user can do as little as specify a target field and a set of predictor fields, but the tool can be extensively fine-tuned by advanced users).

The tool is applicable to a wide range of problems, such as classification, count data, and continuous target regression problems. The method uses a two-step approach to develop a model. In the first step (known as the forward pass, and is similar to the recursive partitioning algorithm used in the Decision Tree tool) the variables that matter most in predicting the target determined, and appropriate "split points" (known as "knots") in the variables are found. However, unlike in a decision tree, a line between adjacent knots (called a term) is fit rather than using discrete jumps as is done in decision trees. This results in a the construction of a piecewise linear function for each variable that can closely approximate any relationship between the target and the predictor variables. The second step (known as the backward or pruning pass) removes some of the knots in the variables (elongating the line segment in the remaining terms) in order to minimize the chance that the model is overfitting the estimation sample and capturing estimation sample noise as opposed to the underlying signal.

This tool uses the R programming language. Go to Options > Download Predictive Tools to install R and the packages used by the R Tool.

Input

An Alteryx data stream that includes a target field of interest along with one or more possible predictor fields.

Configuration Properties

Required Parameters

• Model name: Each model needs to be given a name so it can later be identified. Model names must start with a letter and may contain letters, numbers, and the special characters period (".") and underscore ("_"). No other special characters are allowed, and R is case sensitive.

• Select the target field: Select the field from the data stream you want to predict.

• Select the predictor fields: Choose the fields from the data stream you believe "cause" changes in the value of the target variable.

Columns containing unique identifiers, such as surrogate primary keys and natural primary keys, should not be used in statistical analyses. They have no predictive value and can cause runtime exceptions.

• Include effects plots?: If checked, then effects plots will be produced that graphically show the relationship between the predictor variable and the target at fixed levels (the median for numeric predictors, the first level for factors) of other predictor fields. There are options to display only the fields that have a main effect on the target, only the two-way interaction effects between fields using a perspective plot, or both the main effects and the two-way interactions.

Model Customization (optional)

• Specify target type and the GLM family: There are five types of target fields supported:

• Continuous targets (e.g., numeric targets where any given unique value has a relatively small percentage of the total instances)
• Gamma distributed targets (strictly positive numeric targets that have a high percentage of fairly low response values, but a small percentage of relatively high values)
• "Count" targets (e.g., integer valued targets for which most unique values have a fairly high percentage of the total instances, for example, the number of visits to a doctors office a person makes in a year)
• Binary categorical targets (e.g., target fields of the "yes/no" variety)
• Multinomial categorical targets (e.g., target fields with a limited number of discrete outcomes, such as "A", "B", or "C")

Each type of target field can have one or more possible associated distribution functions (which is related to the measure the algorithm is attempting to minimize).

• Continuous targets can have either no explicit distribution or Gaussian (i.e., Normal) distribution.
• In the case of a Gamma distributed target, the choice is the link function to use (the relationship between the means of the distribution and linear predictor).
• Count (integer) targets minimize a loss function based on the Poisson distribution, and use either a log (preferred) or identity link function.
• Binary categorical targets can use a logit (also used in classical logistic regression), a probit, or a complementary log-log link function.
• A multinomial categorical response is treated in an nonstandard way. Specifically, instead of estimating a true multinomial model, a set of binary models (using a logit link function) are estimated. For instance, if the possible responses are "A", "B", or "C", what is estimated is a model of: "A" against any other choice, "B" against any other choice, and "C" against any other choice.
• Scale the target variable: If the target variable is a continuous variable, and this option is selected, then it will be subjected to a z-score (mean zero, standard deviation of one) transformation to help with numeric stability in the forward pass (first stage) of the algorithm.

• The maximum number of knots or determine automatically (Auto): This option controls the number of possible knots (splits) in the predictor fields in the forward pass (first stage) of the algorithm. If "Auto" is selected, the number of knots is calculated based on the number of predictor fields. The actual number of knots in the forward pass will often be less than the maximum allowed.

• Interaction depth: The level of interaction between predictor fields.

• In the case of two predictor fields that have a two-way interaction with one another, then the effect that one predictor has on the target depends on the level of the second predictor.
• In case of a three-way interaction, then the effect of a predictor field on a target will depend on the values of two other predictor fields.
• Up to five-way interactions (an interaction depth of 5) can be specified. The default value of this parameter is set to 1 (an implicit assumption of no interactions between predictor fields). Increasing the interaction depth can greatly increase model run-time.
• Penalty per term or knot: The function to be optimized contains a penalty component to decrease the possibility that the final model overfits the estimation sample data. The default is a value of 2 for a main effects only model, and 3 if two-way or higher interactions are allowed. A value of -1 results in no penalty for knots or terms being applied, while a value 0 applies the default penalty only to terms.

• The minimal improvement in R-Squared needed to add an additional knot: The higher the value of this terms, the higher the improvement in R-Squared needs to be for the model in order to add a knot.

• The minimum distance between knots: If 0 is selected then the minimum distance allowed is calculated based on a formula, a value of 1 allows any of the values of a predictor variable to be a knot (which only works well if the data has very little noise), otherwise a numeric value between 0 and 1 needs to be provided which gives the distance as a percentage of the range of a predictor variable.

• New variable penalty: The additional penalty term appended to the objective function for the addition of a new variable to the model. The default is 0 (none) and this value can range from 0 to 0.5. As with the penalty per knot or term, the purpose is to control for overfitting

• The maximum number of parent terms considered at each step in the forward pass: This term controls the number of terms that are created in a forward pass, which can speed up execution. A special value of 0 places no limits on the term, while a number greater than 0 specifies the maximum number of terms. The default is 20 terms, common values for this are 20, 10, and 5.

• The fast MARS aging coefficient: See Section 3.1 of Friedman (1993) for an explanation of this parameter.***

• Perform a cross validation analysis: This option allows for a cross validation assessment as to whether sufficient pruning has taken place relative to the generalized cross-validation method used by the algorithm. If this option is selected, then the user can specify the number of separate cross validation runs, the number of folds in each cross validation run, whether the cross validation samples are stratified to have a comparable mix of responses for a categorical target (e.g., a comparable number of "yes" and "no" responses for a binary categorical variable), and the random seed value for the random numbers generated to create the samples.

• The pruning method: The choices are "Backward elimination", "None", "Exhaustive search", "Forward selection", and "Sequential replacement".

• Backward elimination (the default) begins with all of the knots and terms found in the forward pass, and then removes the least predictive term first (making appropriate adjustments to the affected remaining terms), and then compares the effect on the generalized cross-validation measure relative to the full model. If the generalized cross-validation measure is not improved by removing a term, the model created after the forward pass is returned. If there is an improvement in the generalized cross-validation measure this term is removed from the model, and the process is repeated for the remaining terms. If at any point removing a term does not improve the generalized cross-validation measure relative to the model created in the last iteration, the process is terminated.
• If the choice is none, all terms found in the forward pass are used in the final model.
• In exhaustive search, all combinations of the terms found in the forward search step are examined, but at a very high computational cost.
• In forward deletion all terms except the intercept are removed, and then the best term of those found in the forward pass is determined and included in the model (assuming it improves the generalized cross-validation measure relative to an intercept only model). This process is continued until no additional term can be added that improves the generalized cross-validation measure.
• In sequential replacement, a solution with a given number of terms has one term replaced by all other possible remaining terms found in the forward pass, that is not already included in the set of terms in the pruning pass. If a new term is found that improves the generalized cross-validation measure relative to the original term, the original term is replaced by the new term.
• The maximum number of terms in the pruned model: If 0 is selected (the default) then all terms that remain after the other criteria used in the pruning pass are applied are used in the final model, otherwise, only the most important terms up to the selected number are retained in the final model.

Graphics Options

• Plot size: Select inches or centimeters for the size of the graph.
• Graph resolution: Select the resolution of the graph in dots per inch: 1x (96 dpi); 2x (192 dpi); or 3x (288 dpi). Lower resolution creates a smaller file and is best for viewing on a monitor. Higher resolution creates a larger file with better print quality.

• Base font size (points): Select the size of the font in the graph.

Output

• O Output: Consists of a table of the serialized model with its model name.

• R Output: Consists of the report snippets generated by the Spline Model tool: a basic model summary, a Variable Importance Plot (which indicates the relative importance of the different predictor fields), a Basic Model Diagnostics Plot, and (optionally) the Effects Plots.