Forest Model Tool
The Forest Model tool creates a model that constructs a set of decision tree models to predict a target variable based on one or more predictor variables. The different models are constructed using random samples of the original data, a procedure known as bootstrapping. In addition, only a limited number of variables is considered at each tree split, with the number determined set either automatically by R or set by the user. See Random Forest.
This tool uses the R tool. Go to Options > Download Predictive Tools and sign in to the Alteryx Downloads and Licenses portal to install R and the packages used by the R Tool. See Download and Use Predictive Tools.
Connect an Alteryx data stream or XDF metadata stream that includes a target field of interest along with one or more possible predictor fields.
If the input data is from an Alteryx data stream, then the open source R randomForest function (from the randomForest package) is used for model estimation.
If the input data comes from either an XDF Output Tool or XDF Input Tool, then the RevoScaleR rxDForest function is used for model estimation. The advantage of using the RevoScaleR based function is that it allows much larger (out of memory) datasets to be analyzed, but at the cost of additional overhead to create an XDF file, and it uses an algorithm that needs to make more passes over the data to create each tree in the ensemble (so is much slower) than the open source randomForest function. As a result, reducing the number of trees in the ensemble from the default 500 trees is highly recommended.
- Model name: Type a name for the model to identify the model when it is referenced in other tools. Model names must start with a letter and may contain letters, numbers, and the special characters period (.) and underscore (_). No other special characters are allowed, and R is case sensitive.
- Select the target variable: Select the data to be predicted. A target variable is also known as a response or dependent variable.
- Select the predictor variables: Select the data to use to influence the value of the target variable. A predictor variable is also known as a feature or an independent variable. Any number of predictor variables can be selected, but the target variable should not also be a predictor variable. Each categorical predictor variable can have a maximum of 32 classes.
- Number of trees to use: Select the number of tree models to include in the forest. The default is 500 based on the finding of Breiman. Decrease the value with an XDF metadata stream if the length of model runtime is a concern.
- Select a specific number of variables to select between at each split: Select the number of variables to be considered at each split.
Columns containing unique identifiers, such as surrogate primary keys and natural primary keys, should not be used in statistical analyses. They have no predictive value and can cause runtime exceptions.
Click Model Customization to modify the model settings.
- Directly limit the overall size of each model tree
- The total allowable nodes in a tree: Select for models that use the open source randomForest model.
- The maximum depth of a model tree: Select for models that use the RevoScaleR rxDForest function is used
- The minimum number of records allowed in a tree node: Select a value to control the size of the smallest allowed terminal node in each ensemble tree. Increasing this number will reduce the total number of nodes in each tree.
- Select the records for the creation of each model with replacement: Select to control whether the bootstrap replicates are drawn from the full estimation sample with or without replacement.
- The percentage of the data records to sample from to create each tree: Select to control whether all or only part of the full estimation sample will be used for forming each bootstrap replicate.
- Plot size: Select inches or centimeters for the size of the graph.
-
Graph resolution: Select the resolution of the graph in dots per inch: 1x (96 dpi); 2x (192 dpi); or 3x (288 dpi). Lower resolution creates a smaller file and is best for viewing on a monitor. Higher resolution creates a larger file with better print quality.
- Base font size (points): Select the size of the font in the graph.
Connect a Browse tool to each output anchor to view results.
- O anchor: Displays the model name and size of the object in the Results window.
- R anchor: Displays a summary report of the model that includes a summary and plots.
The Forest Model tool supports Microsoft SQL Server 2016 in-database processing. See In-Database Overview for more information about in-database support and tools.
When a Forest Model tool is placed on the canvas with another In-DB tool, the tool automatically changes to the In-DB version. To change the version of the tool, right-click the tool, point to Choose Tool Version, and click a different version of the tool. See Predictive Analytics for more about predictive in-database support.
Connect an in-database data stream that includes a target field of interest along with one or more possible predictor fields.
If the input is from a SQL Server or Teradata in-database data stream, then the Microsoft Machine Learning Server rxDForest function (from the RevoScaleR package) is used for model estimation. This allows the processing to be done on the database server, as long as both the local machine and the server have been configured with Microsoft Machine Learning Server, and can result in a significant improvement on performance.
- Model name: Each model needs to be given a name so it can later be identified. The choice is to either provide a name, or have a name automatically generated. Model names must start with a letter and may contain letters, numbers, and the special characters period (".") and underscore ("_"). No other special characters are allowed, and R is case sensitive.
- Select the target variable: Select the field from the data stream you want to predict.
- Select the predictor variables: Choose the fields from the data stream you believe "cause" changes in the value of the target variable.
- Number of trees to use: Select the number of tree models to include in the forest. The default is 500 based on the finding of Breiman. Decrease the value with an XDF metadata stream if the length of model runtime is a concern.
- Select a specific number of variables to select between at each split: Select the number of variables to be considered at each split.
- Use sampling weights for model estimation: Click the check box and then select a weight field from the data stream to estimate a model that uses sampling weight. A field is used as both a predictor and the weight variable, then the weight variable will appear in the model call in the output with the string "Right_" prepended to it.
Columns containing unique identifiers, such as surrogate primary keys and natural primary keys, should not be used in statistical analyses. They have no predictive value and can cause runtime exceptions.
- Directly limit the overall size of each model tree
- The total allowable nodes in a tree: Select for models that use the open source R randomForest model.
- The maximum depth of a model tree: Select for models that use the RevoScaleR rxDForest function.
- The minimum number of records allowed in a tree node: Select a value to control the size of the smallest allowed terminal node in each ensemble tree. Increasing this number will reduce the total number of nodes in each tree.
- Select the records for the creation of each model with replacement: Select to control whether the bootstrap replicates are drawn from the full estimation sample with or without replacement.
- The percentage of the data records to sample from to create each tree: Select to control whether all or only part of the full estimation sample will be used for forming each bootstrap replicate.
- Plot size: Select inches or centimeters for the size of the graph.
-
Graph resolution: Select the resolution of the graph in dots per inch: 1x (96 dpi); 2x (192 dpi); or 3x (288 dpi). Lower resolution creates a smaller file and is best for viewing on a monitor. Higher resolution creates a larger file with better print quality.
- Base font size (points): Select the size of the font in the graph.
Connect a Browse tool to each output anchor to view results.
- O anchor: Displays the model name and size of the object in the Results window.
- R anchor: Displays a summary report of the model that includes a summary and plots.