The Stepwise tool determines the best predictor variables to include in a model out of a larger set of potential predictor variables for linear, logistic, and other traditional regression models.
There are two basic approaches used in implementing stepwise regression. The first (known as backward selection) involves the use of a model that includes all variables that are thought to potentially influence the target variable, and then sequentially removes the least important variable from the original model based on a goodness of fit measure that adjusts for the number of variables included in the model. This process continues, with other variables being removed in subsequent backward steps, until the there are no further improvements in the adjusted fit measure. The second basic approach (known as forward variable selection) starts with a model that includes only a constant, and then adds to that model one variable out of the set of potential variables that provides the greatest improvement in the adjusted fit measure. This process is repeated to add additional variables using a set of additional forward steps, which ends when there is no further improvement in the adjusted fit measure. In the base of backward variables selection, a variable that is removed never re-enters in subsequent steps, while in forward selection a variable is never removed in later steps once it has been added. A hybrid approach can be used that starts with a large ("maximal") initial model and a first backward step, but then both forward and backward movements are evaluated in each subsequent step.
The Alteryx R-based stepwise regression tool makes use of both backward variable selection and mixed backward and forward variable selection. To use the tool, first create a "maximal" regression model that includes all of the variables you believe could matter, and then use the stepwise regression tool to determine which of these variables should be removed based on an adjusted fit measure. A choice of two different adjusted fit measures are provided to the user, the Akaike information criterion** (or AIC) and the Bayesian information criterion*** (or BIC). These two measures are similar to one another, but the BIC places a larger penalty on the number of variables included in the model, typically resulting in a final model with fewer variables than is the case when the AIC is used.
With this tool, if the input data is from a regular Alteryx data stream, then the applicable open source R function is used for model estimation. If the input comes from either an XDF Output or XDF Input tool, then the appropriate Revo ScaleR function is used for model estimation. The advantage of using the Revo ScaleR based function is that it allows much larger (out of memory) datasets to be analyzed, but with the inability to create some of the model diagnostic output that is available with the open source R functions.
This tool uses the R programming language. Go to Options > Download Predictive Tools to install R and the packages used by the R Tool.
The output stream from a Count Regression, Linear Regression or Logistic Regression tool used to create the "maximal" model. The stream can be entered into either side of the tool.
The same Alteryx data stream or XDF metadata streamthat was used to create the "maximal" model. The stream can be entered into either side of the tool.
The name of the new model: This is the best model found using the stepwise variable selection based on the search direction and selection criteria chosen. Model names must start with a letter and may contain letters, numbers, and the special characters period (".") and underscore ("_"). No other special characters (such as spaces) are allowed, and R is case sensitive.
Search direction: Whether the direction of search involves both backward and forward steps (the method begins with a backward step), or only backward steps are used.
Adjusted fit measure: The criteria used to compare different models and selection the best model. The choices provided are the Akaike information criterion (AIC) or the Bayesian information criteria (BIC).
Graph resolution: Select the resolution of the graph in dots per inch: 1x (96 dpi); 2x (192 dpi); or 3x (288 dpi). Lower resolution creates a smaller file and is best for viewing on a monitor. Higher resolution creates a larger file with better print quality.
O Output: Consists of a table of the serialized model with model name and the size of the object.
R Output: Consists of the report snippets generated by the Stepwise tool: a statistical summary, Type II Analysis of Deviance or ANOVA Tests, and basic diagnostic plots. The Type II Analysis of Deviance or ANOVA table and the Basic Diagnostic Plots are not produced when the data input comes from a XDF Output or XDF Input tool.
*https://en.wikipedia.org/wiki/Stepwise_regression
**https://en.wikipedia.org/wiki/Akaike_information_criterion
***https://en.wikipedia.org/wiki/Bayesian_information_criterion
Â©2017 Alteryx, Inc., all rights reserved. AllocateÂ®, AlteryxÂ®, GuzzlerÂ®, and SolocastÂ® are registered trademarks of Alteryx, Inc.