Boosted Model Tool
The Boosted Model tool creates generalized boosted regression models based on Gradient Boosting methods. The models are created by serially adding simple decision tree models to a model ensemble to minimize an appropriate loss function. These models use a method of statistical learning that:
- self-determines which subset of fields best predicts a target field.
- is able to capture non-linear relationships and interactions between fields.
- can automatically address a broad range of regression and classification problems.
Use the Boosted Model tool for classification, count data, and continuous target regression problems.
This tool uses the R tool. Go to Options > Download Predictive Tools and sign in to the Alteryx Downloads and Licenses portal to install R and the packages used by the R Tool.
The Boosted Model tool requires an input data stream with:
- A target field of interest
- Two or more predictor fields
The packages used in model estimation vary based on the input data stream.
- An Alteryx data stream uses the open source R gbm function.
- An XDF metadata stream, coming from either an XDF Input Tool or XDF Output Tool, uses the RevoScaleR rxBTrees function.
- Data from an SQL Server in-database data stream uses rxBTrees function.
- The Microsoft Machine Learning Server installation leverages the RevoScaleR rxBTrees function for your data in your SQL Server or Teradata databases. This requires the local machine and server to be configured with Microsoft Machine Learning Server, which allows processing on the database server and results in a significant performance improvement.
Algorithm performance
Compared to the open source R functions, the RevoScaleR-based function can analyze much larger datasets. However, the RevoScaleR-based function must create an XDF file, which increases the overhead cost, uses an algorithm that makes more passes over the data, increasing runtime, and cannot create some model diagnostic outputs.
- Required parameters: The basic fields needed to generate a boosted model.
- Model name: A name for the model that can be referenced by other tools. The model name or prefix must start with a letter and may contain letters, numbers, and the special characters period (".") and underscore ("_"). R is case sensitive.
- Select the target variable: The data field to be predicted, also known as a response or dependent variable.
- Select the predictor fields: The data fields used to influence the value of the target variable, also known as a feature or independent variable. Two predictor fields are required at a minimum, but there is no upper limit on the number of predictor fields selected. The target variable itself should not be used in calculating the target value, so the target field should not be included with the predictor fields.
- Use sampling weights in model estimation: An option that allows you to select a field that weights the importance placed on each record when creating a model estimation.
If a field is used as both a predictor and a sample weight, the output weight variable field will be prepended with Right_.
- Select the sampling weight field: The field used to weight the records.
- Include marginal effect plots?: An option to include plots in the report that show the relationship between the predictor variable and the target, averaging over the effect of other predictor fields.
- The minimal level of importance of a field to be included in the plots: A percentage value that indicates the minimum predictive power of a variable to be included in the marginal effect plot. A higher percentage reduces the number of marginal effect plots produced.
- Model customization: Optional settings that customize the output model based on target and how decision trees are managed.
- Specify target type and the loss function distribution: The category of data in the target field and the associated function that works to optimize model creation.
- Continuous target: A numeric target in which any given unique value comprises a small percentage of the total instances, such as yearly sales per store.
For a continuous target, minimize a loss function based on one of the following distributions: - Gaussian (squared error loss)
- Laplace (absolute value loss)
- t-distribution loss
- Count (integer) target: A numeric target for which most unique values comprise a large percentage of the total instances, such as the number of visits to a doctor’s office a person makes in a year.
For a count target, minimize a loss function based on the Poisson distribution.
- Binary (two outcomes) categorical: A categorical target with two possible outcomes, such as yes-no categorization.
For a binary categorical target, minimize a loss function based on one of the following distributions:
- Bernoulli (logistic regression)
- AdaBoost (exponential loss)
- Multinomial (three or more outcomes) categorical: A categorical target fields with a limited number of discrete outcomes, such as A, B, or C categorization.
For a multinomial categorical target, minimize a loss function based on a multinomial logistic loss function, a multinomial generalization of the Bernoulli loss function.
- The maximum number of trees in the model: The number of decision trees that the algorithm can include in the final model. The default value is 4000. A higher number of trees increases the run time.
- Method to determine the final number of trees in the model: The method used to determine the number of decision trees that adequately capture the predictive behavior without over-fitting the sample data.
- Cross validation: Method of validation with efficient use of available information. Recommended in cases with limited data.
- Number of cross validation folds: The number of subsamples the data is divided into for validation or training. The default value is 5. Common values are 5 and 10. In a case with 5 folds, the data is divided into 5 unique subsamples and 5 different models are created, each using data from 4 of the subsamples. The final subsample is withheld from model creation, and is used to test prediction accuracy.
- Number of machine cores to use in cross validation: The number of machine cores used in analysis. The default value is 1. The number used should always be less than the number of available cores. To increase computation speed, increase the number of cores used.
- Test (validation) sample: Method of validation that pulls samples from the training data. Recommended in cases with many records.
- The percentage in the estimation (training) sample: The percentage of records used in the training sample, with the remainder used in the test sample. The default value is 50. Common values are 50% and 75%. If 50% of the records are used in the training sample, the remaining 50% is used to test prediction accuracy.
- Out-of-bag: Method of validation that uses records that were excluded in model creation.
- The fraction of observations used in the out-of-bag sample: A sampling percentage used to guide the appropriate number of trees to include in the mode to avoid overfitting. The default value is 50%. Common values are between 25-50%.
- Shrinkage: A value between 0 and 1 used to place weight on each tree added to the model. The default value is .0020. Smaller values allow more trees to be included in the model, which increases run time.
A small shrinkage value may require the value of Set maximum number of decision trees to increase to guarantee an optimal number of trees.
- Interaction depth: The level of interaction between predictor fields. For example, a three-way interaction indicates that one predictor depends on two other predictors to determine the impact on the target field. The default value is Linear, with the assumption of no interactions between predictor fields. Increasing the depth increases the run time.
- Minimum required number of objects in each tree node: A parameter that verifies a decision tree is of sufficient size before allowing the addition of another decision tree. The default is 10. Increasing the value will result in smaller decision trees.
- Random seed value: A value that determines the sequence of draws for random sampling. This causes the same records within data to be chosen, although the method of selection is random and not data-dependent. Change the value to change the sequence of random draws.
- Graphics Options: The settings of the output graph. The defaults are used unless customized.
- Plot size: The size of the output graph. Select the units, then set the values for width and height.
-
Graph resolution: Select the resolution of the graph in dots per inch: 1x (96 dpi); 2x (192 dpi); or 3x (288 dpi). Lower resolution creates a smaller file and is best for viewing on a monitor. Higher resolution creates a larger file with better print quality.
- Base font size (points): The font size in points.
Columns containing unique identifiers, such as surrogate primary keys and natural primary keys, should not be used in statistical analyses. They have no predictive value and can cause runtime exceptions.
These options can be used to modify the model settings.
The Boosted Model tool supports Microsoft SQL Server 2016 in-database processing. See In-Database Overview for more information about in-database support and tools.
To access the In-DB version of the Boosted Model tool:
- Place an In-Database tool on the canvas. The Boosted Model tool automatically changes to Boosted Model In-DB.
- Right-click the Boosted Model tool, point to Choose Tool Version, and select Boosted Model In-DB.
See Predictive Analytics for more about predictive in-database support.
- Required parameters: The basic fields needed to generate a boosted model.
- Each model created requires a name that can referenced by other tools. In-DB processing allows for two model name creation methods:
- Specific model name: A user-determined model name. The model name or prefix must start with a letter and may contain letters, numbers, and the special characters period (".") and underscore ("_"). R is case sensitive.
- Generated model name: The model name is automatically generated.
- Select the target variable: The data field to be predicted, also known as a response or dependent variable.
- Select the predictor variables: The data fields used to influence the value of the target variable, also known as a feature or independent variable. Two predictor values are required at a minimum, but there is no upper limit on the number of predictor values used. The target variable itself should not be used in calculating the target value, so the target field should not be included with the predictor fields.
-
Use sampling weights in model estimation: An option that allows you to select a field that weights the importance placed on each record when creating a model estimation.
If a field is used as both a predictor and a sample weight, the output weight variable field will be prepended with Right_.
- Select the sampling weight field: The field used to weight the records.
- Model customization: Optional settings that customize the output model based on target and how decision trees are managed.
- Specify target type and the loss function distribution:
- Continuous target: A numeric target in which any given unique value contains a small percentage of the total instances, such as yearly sales per store.
For a continuous target, minimize a loss function based on the Gaussian distribution.
- Binary categorical target: A categorical target with two possible outcomes, such as yes-no categorization.
For a binary categorical target, minimize a loss function based on the Bernoulli distributions.
- Multinomial categorical target: A categorical target field with a limited number of discrete outcomes, such as A, B, or C categorization.
For a multinomial categorical target, minimize a loss function based on a multinomial logistic loss function, a multinomial generalization of the Bernoulli loss function.
- Continuous target: A numeric target in which any given unique value contains a small percentage of the total instances, such as yearly sales per store.
- The maximum number of trees in the model: The number of decision trees that the algorithm can add to include in the final model. The default value is 4000. A higher number of trees increases the run time.
- The fraction of observations used in the out-of-bag sample: A sampling percentage used to reduce the number of included decision trees with an out-of-bag assessment. The default value is 50%. Common values are between 25-50%.
- Shrinkage weight: A value between 0 and 1 used to place weight on each tree added to the model. The default value is .0020. Smaller values allow more trees to be included in the model, which increases run time.
A small shrinkage value may require the value of Set maximum number of decision trees to increase to guarantee an optimal number of trees.
- Tree size: To mimic the default tree size settings in the standard Boosted Model tool, use the default values. For more information, see rxBTrees controls.
- maxDepth: Maximum depth of any tree node [1000]
- minBucket: Minimum required number of observations in a terminal node (or leaf) [10]
- minSplit: Minimum number of observations that must exist in a node before a split is attempted [minBucket * 2]
- Random seed value: A value that determines the sequence of draws for random sampling. This causes the same records within data to be chosen, although the method of selection is random and not data-dependent. Change the value to change the sequence of random draws.
- Graphics Options: The settings of the output graph. The defaults are used unless customized.
- Plot size: Select the units, then set the values for width and height.
-
Graph resolution: Select the resolution of the graph in dots per inch: 1x (96 dpi); 2x (192 dpi); or 3x (288 dpi). Lower resolution creates a smaller file and is best for viewing on a monitor. Higher resolution creates a larger file with better print quality.
- Base font size (points): The font size in points.
Columns containing unique identifiers, such as surrogate primary keys and natural primary keys, should not be used in statistical analyses. They have no predictive value and can cause runtime exceptions.
These options can be used to modify the model settings.
Connect a Browse tool to each output anchor to view results.
- O anchor: Outputs the model name and size in the Results window.
- R anchor: Displays a Report of the model that includes a summary and any plots configured.