Decision Tree Tool
The Decision Tree tool creates a set of if-then split rules to optimize model creation criteria based on Decision Tree Learning methods. Rule formation is based on the target field type:
- If the target field is a member of a category set, a classification tree is constructed.
- If the target field is a continuous variable, a regression tree is constructed.
Use the Decision Tree tool when the target field is predicted using one or more variable fields, such as a classification or continuous target regression problem.
This tool uses the R tool. Go to Options > Download Predictive Tools and sign in to the Alteryx Downloads and Licenses portal to install R and the packages used by the R Tool.
The Decision Tee tool requires an input with:
- A target field of interest
- One or more predictor fields
The packages used in model estimation vary based on the input data stream.
- An Alteryx data stream uses the open source R rpart function.
- An XDF metadata stream, coming from either an XDF Input Tool or XDF Output Tool, uses the RevoScaleR rxDTree function.
- Data from an SQL Server in-database data stream uses rxBTrees function.
- The Microsoft Machine Learning Server installation leverages the RevoScaleR rxBTrees function for your data in your SQL Server or Teradata databases. This requires the local machine and server to be configured with Microsoft Machine Learning Server, which allows processing on the database server and results in a significant performance improvement.
RevoScaleR capabilities
Compared to the open source R functions, the RevoScaleR-based function can analyze much larger datasets. However, the RevoScaleR-based function must create an XDF file, which increases the overhead cost, uses an algorithm that makes more passes over the data, increasing runtime, and cannot create some model diagnostic outputs.
These options are required to generate a decision.
- Type model name: A name for the model that can be referenced by other tools. The model name or prefix must start with a letter and may contain letters, numbers, and the special characters period (".") and underscore ("_"). R is case sensitive.
- Select target variable: The data field to be predicted, also known as a response or dependent variable.
- Select predictor variables: The data fields used to influence the value of the target variable, also known as a feature or independent variable. One predictor field is required at a minimum, but there is no upper limit on the number of predictor fields selected. The target variable itself should not be used in calculating the target value, so the target field should not be included with the predictor fields.
Columns containing unique identifiers, such as surrogate primary keys and natural primary keys, should not be used in statistical analyses. They have no predictive value and can cause runtime exceptions.
Click Customize to adjust additional settings.
Model: The options that change how the model evaluates data and is built.
Choose algorithm: Select the rpart function or the C5.0 function.
rpart: An algorithm based on the work of Breiman, Friedman, Olshen, and Stone; considered the standard. Use rpart if you are creating a regression model or if you need a pruning plot.
Model Type and Sampling Weights: Controls for the type of model based on the target variable and the handling of sampling weights.
- Model Type: The type of model used to predict the target variable.
- Auto: The model type is automatically selected based on the target variable type.
- Classification: The model predicts a discrete text value of a category or group.
- Regression: The model predicts continuous numeric values.
- Use sampling weights in model estimation: An option that allows you to select a field that judges the importance placed on each record and weights the record accordingly when creating a model estimation.
If a field is used as both a predictor and a sample weight, the output weight variable field is prepended with “Right_”.
Splitting Criteria and Surrogates: Controls for how the model determines a split and how surrogates are used in assessing data patterns.
- The splitting criteria to use: Select the way the model evaluates when a tree should be split.
- Gini coefficient
- Information index
- Use surrogates to: Select the method for using surrogates in the splitting process. Surrogates are variables related to the primary variable that are used to determine the split outcome for a record with missing information.
- Omit observations with missing value for primary split rule: The record missing the candidate variable is not considered in determining the split.
- Split records missing the candidate variable: All records missing the candidate variable are distributed evenly on the split.
- Send observation in majority direction if all surrogates are missing: All records missing the candidate variable are pushed to the side of the split that contains more records.
- Select best surrogate split using: Select the criteria for choosing the best variable to split on from a set of possible variables.
- Number of correct classifications for a candidate variable: Chooses the variable to split on based the total number of records that are correctly classified.
- Percentage of correct classifications for a candidate variable Chooses the variable to split on based on the percentage of records that are correctly classified.
The splitting criteria when using a Regression model is always Least Squares.
The Gini impurity is used.
HyperParameters: Controls for the model's prior distribution.
- The minimum number of records needed to allow for a split: Set the number of records that must exist before a split occurs. If there are fewer records than the minimum number, then no further splits are allowed.
- The allowed minimum number of records in a terminal node: Set the number of records that can be in a terminal node. A lower number increases the potential number of final terminal nodes at the end of the tree.
- The number of folds to use in the cross-validation to prune the tree: Set the number of groups (N) the data should be divided into when testing the model. The number defaults to 10, but other common values are 5 and 20. A higher number of folds gives more accuracy to the tree but may take longer to process. When the tree is pruned by using a complexity parameter, cross-validation determines how many splits, or branches, are in the tree. In cross validation, N - 1 of the folds are used to create a model, and the other fold is used as a sample to determine the number of branches that best fits the holdout fold in order to avoid overfitting.
- The maximum allowed depth of any node in the final tree: Set the number of levels of branches allowed from the root node to the most distant node from the root to limit the overall size of the tree.
- The maximum number of bins to use for each numeric variable: Enter the number of bins to use for each variable. By default, the value is calculated based on the minimum number of records needed to allow for a split.
- Set complexity parameter: A value that controls the size of the decision tree. A smaller value results in more branches in the tree, and a larger value results in fewer branches. If a complexity parameter is not selected, the parameter is determined based on cross-validation.
XDF metadata stream only
This option only applies when the input into the tool is an XDF metadata stream. The Revo ScaleR function (rxDTree) that implements the scalable decision tree handles numeric variables via an equal interval binning process to reduce the computation complexity.
C5.0: An algorithm based on the work of Quinlan; use C5.0 if your data is sorted into one of a small number of mutually exclusive classes. Properties that may be relevant to the class assignment are provided, although some data may have unknown or non-applicable values.
Structural Options: Controls for the model's structure. By default, the model is structured as a decision tree.
- Decompose tree into rule-based model: Change the structure of the output algorithm from a decision tree into a collection of unordered, simple if-then rules.
- Threshold number of bands to group rules into: Select to Select a number of bands to group rules into where the number set is the band threshold.
Detailed Options: Controls for the model's splits and features.
- Model should evaluate groups of discrete predictors for splits: Group categorical predictor variables together. Select to reduce overfitting when there are important discrete attributes that have more than four or five values.
- Use predictor winnowing (i.e. feature selection): Select to simplify the model by attempting to exclude non-useful predictors.
- Prune tree: Select to simplify the tree to reduce overfitting by removing tree splits.
- Evaluate advanced splits in the data: Select to perform evaluations with secondary variables to confirm what branch is the most accurate prediction.
- Use stopping method for boosting: Select to evaluate if boosting iterations are becoming ineffective and, if so, stop boosting.
Numerical Hyperparameters: Controls for the model's prior distribution that are based on a numeric value.
- Select number of boosting iterations: Select a 1 to use a single model.
- Select confidence factor: This is the analog of rpart’s complexity parameter.
- Select number of samples that must be in at least 2 splits: A larger number gives a smaller, more simplified, tree.
- Percent of data held from training for model evaluation: Select the portion of the data used to train the model. Use the default value 0 to use all of the data to train the model. Select a larger value to hold that percent of data from training and evaluation of model accuracy
- Select random seed for algorithm: Select the value of the seed. The value must be a positive integer.
Cross-Validation: Controls for customizing a method of validation with efficient use of available information.
- Use cross-validation to determine estimates of model quality: Select to perform cross-validation to obtain various model quality metrics and graphs. Some metrics and graphs are displayed in the R output, and others are displayed in the I output.
- Number of cross-validation folds: The number of subsamples the data is divided into for validation or training. A higher number of folds results in more robust estimates of model quality, but fewer folds make the tool run faster.
- Number of cross-validation trials: The number of times the cross-validation procedure is repeated. The folds are selected differently in each trial, and the results are averaged across all the trials. A higher number of folds results in more robust estimates of model quality, but fewer folds make the tool run faster.
- Set seed for external cross-validation: A value that determines the sequence of draws for random sampling. This causes the same records within data to be chosen, although the method of selection is random and not data-dependent.
- Select value of random seed for cross-validation: Select the value of the seed. The value must be a positive integer.
Plots: Select and configure what graphs appear in the output report.
Display static report: Select to display a summary report of the model from the R output anchor. Selected by default.
Tree Plot: A graph of decision tree variables and branches.
Display tree plot: Click to include a graph of decision tree variables and branches in the model report output.
- Uniform branch distances: Select to display the tree branches with uniform length or proportional to the relative importance of a split in predicting the target.
- Leaf summary: Determine what is displayed on the final leaf nodes in the tree plot. Select Counts if the number of records is displayed. Select Proportions if the percentage of total records is displayed.
- Plot size: Select if the graph is displayed in Inches or Centimeters.
- Width: Set the width of the graph using the unit selected in Plot size.
- Height: Set the height of the graph using the unit selected in Plot size.
-
Graph resolution: Select the resolution of the graph in dots per inch: 1x (96 dpi); 2x (192 dpi); or 3x (288 dpi). Lower resolution creates a smaller file and is best for viewing on a monitor. Higher resolution creates a larger file with better print quality.
- Base font size (points): Select the size of the font in the graph.
Prune Plot: A simplified graph of the decision tree.
Display prune plot: Click to include a simplified graph of the decision tree in the model report output.
- Plot size: Select if the graph is displayed in Inches or Centimeters.
- Width: Set the width of the graph using the unit selected in Plot size.
- Height: Set the height of the graph using the unit selected in Plot size.
-
Graph resolution: Select the resolution of the graph in dots per inch: 1x (96 dpi); 2x (192 dpi); or 3x (288 dpi). Lower resolution creates a smaller file and is best for viewing on a monitor. Higher resolution creates a larger file with better print quality.
- Base font size (points): Set the size of the font in the graph.
The Decision Tree tool supports Microsoft SQL Server 2016 and Teradata in-database processing. See In-Database Overview for more information about in-database support and tools.
When a Decision Tree tool is placed on the canvas with another In-DB tool, the tool automatically changes to the In-DB version. To change the version of the tool, right-click the tool, point to Choose Tool Version, and click a different version of the tool. See Predictive Analytics for more about predictive in-database support.
- Model name: Each model needs to be given a name so it can later be identified.
- A specific model name: Enter The model name you wish to use for the model. Model names must start with a letter and may contain letters, numbers, and the special characters period (".") and underscore ("_"). No other special characters are allowed, and R is case sensitive.
- Automatically generate a model name: Designer automatically generates a model name that meets the required parameters.
- Select the target variable: Select the field from the data stream you want to predict.
- Select the predictor variables: Choose the fields from the data stream you believe "cause" changes in the value of the target variable.
- Use sampling weights in model estimation (Optional): Select to choose a field from the input data stream to use fo sampling weight.
- Select the sampling weight field: Select a weight field from the data stream to estimate a model that uses sampling weight. A field is used as both a predictor and the weight variable.
Columns containing unique identifiers, such as surrogate primary keys and natural primary keys, should not be used in statistical analyses. They have no predictive value and can cause runtime exceptions.
The weight variable appears in the model call in the output with the string "Right_" prepended to it.
- Model type: Select what type of model is going to be used.
- Classification: A model to predict a categorical target. If using a classification model, also select the splitting criteria.
- Gini coefficient
- Entropy-based Information index
- Regression: A model to predict a continuous numeric target.
- The minimum number of records needed to allow for a split: If along a set of branches of a tree there are fewer records than the selected minimum number than no further splits are allowed.
- Complexity parameter: This parameter controls how splits are carried out (i.e., the number of branches in the tree). The value should be under 1, and the smaller the value, the more branches in the final tree. A value of "Auto" or omitting a value will result in the "best" complexity parameter being selected based on cross-validation.
- The allowed minimum number of records in a terminal node: The smallest number of records that must be contained in a terminal node. Decreasing this number increases the potential number of final terminal nodes.
- Surrogate use: This group of option controls how records with missing data in the predictor variables at a particular split are addressed. The first choice is to omit (remove) a record with a missing value of the variable used in the split. The second is to use "surrogate" splits, in which the direction a record will be sent is based on alternative splits on one or more other variables with nearly the same results. The third choice is to send the observation in the majority direction at the split.
- Omit an observation with a missing value for the primary split rule
- Use surrogates in order to split records missing the candidate variable
- If all surrogates are missing, the send the observation in the majority direction
- The total number of correct classifications for a potential candidate variable
- The percentage correct calculated over the non-missing values of a candidate variable
- The number of folds to use in the cross validation to prune the tree: When the tree is pruned through the use of a complexity parameter, cross validation is used to determine how many splits, thus branches, are in the tree. This is done via the use of cross validation whereby N - 1 of the folds are used to create a model, and the Nth fold is used as a sample to determine the number of branches that best fits best the holdout fold in order to avoid overfitting. One thing that can be altered by the user is the number of groups (N) into which the data should be divided. The default is 10, but other common values are 5 and 20.
- The maximum allowed depth of any node in the final tree: This option limits the overall size of the tree by indicating how many levels are allowed from the root node to the most distant node from the root.
- The maximum number of bins to use for each numeric variable: The Revo ScaleR function (rxDTree) that implements the scalable decision tree handles numeric variables via an equal interval binning process to reduce the computation complexity. The choices for these are "Default", which uses a formula based on the minimum number of records needed to allow for a split, but can be manually set by the user. This option only applies in cases where the input into the tool is an XDF metadata stream.
- Tree plot: This set of options controls a number of options associated with plotting a decision tree.
- Leaf summary: The first choice under this option is the nature of the leaf summary. This option controls whether counts or proportions are printed in the final leaf nodes in the tree plot.
- Counts:
- Proportions:
- Uniform branch distances: The second option is whether uniform branch distances should be used. This option controls whether the length of the drawn tree branches reflect the relative importance of a split in predicting the target or are of uniform length in the tree plot.
- Pruning plot: This option allows you to set the size, resolution, and base font of the pruning plot in an analogous way to the tree plot.
- Plot size: Set the dimensions of the output tree plot.
- Inches: Set the Width and Height of the plot.
- Centimeters: Set the Width and Height of the plot.
-
Graph resolution: Select the resolution of the graph in dots per inch: 1x (96 dpi); 2x (192 dpi); or 3x (288 dpi). Lower resolution creates a smaller file and is best for viewing on a monitor. Higher resolution creates a larger file with better print quality.
- Base font size (points): The font size in points.
Connect a Browse tool to each output anchor to view results.
- O (Output): Displays the model name and size of the object in the Results window.
- R (Report): Displays a summary report of the model that includes a summary and plots.
- I (Interactive): Displays an interactive dashboard of supporting visuals that allows you to zoom, hover, and click.
Expected behavior: plot precision
When using the Decision Tree tool for standard processing, the Interactive output shows greater precision with numeric values than the Report output.