Skip to main content

Linear Regression Tool Icon Linear Regression Tool

One Tool Example

Linear Regression has a One Tool Example. Visit Sample Workflows to learn how to access this and many other examples directly in Alteryx Designer.

The Linear Regression Tool creates a simple model to estimate values, or evaluate relationships between variables based on a linear relationship.

The 2 main types of linear regression are non-regularized and regularized:

  • Non-regularized linear regression produces linear models that minimize the sum of squared errors between the actual and predicted values of the training data target variable.

  • Regularized linear regression balances the same minimization of sum of squared errors with a penalty term on the size of the coefficients and tends to produce simpler models that are less prone to overfitting.

This tool uses the R tool. Go to Options > Download Predictive Tools and sign in to the Alteryx Downloads and Licenses portal to install R and the packages used by the R tool. Visit Download and Use Predictive Tools.

R Packages Used by Linear Regression

R Package

Type

Package Description

AlteryxPredictive

Custom

This package provides custom functions and calls CRAN and custom R packages.

AlteryxRDataX

Custom

This package provides connectivity between Alteryx and R as well as a number of functions to facilitate the interaction between Alteryx and R.

AlteryxRviz

Custom

This package has been deprecated. It provides functions that drive interactive visualizations for the predictive tools in Alteryx (Time Series, Network Analysis).

flightdeck

Custom

This package makes it easy to create interactive dashboards for reporting outputs of predictive models.

Configure the Tool for Standard Processing

Connect an Input

Connect an Alteryx data stream or XDF metadata stream that includes a target field of interest along with one or more possible predictor fields.

Note

XDF is MRC/MMLS format.

If the input data is from an Alteryx data stream, then the open-source R lm function and the glmnet and cv.glmnet functions (from the glmnet package) is used for model estimation.

If the input data comes from either an XDF Output tool or XDF Input tool, then the RevoScaleR rxLinMod function is used for model estimation. The advantage of using the RevoScaleR based function is that it allows much larger (out of memory) datasets to be analyzed, but at the cost of additional overhead to create an XDF file and the inability to create some of the model diagnostic output that is available with the open source R functions.

Configure the Tool

  • Model name: Enter a name for the model to identify the model when it is referenced in other tools. Model names must start with a letter and may contain letters, numbers, and the special characters period (.) and underscore (_). No other special characters are allowed, and R is case-sensitive.

  • Select the target variable: Select the data to be predicted. A target variable is also known as a response or dependent variable.

  • Select the predictor variables: Select the data to use to influence the value of the target variable. A predictor variable is also known as a feature or an independent variable. Any number of predictor variables can be selected, but the target variable should not also be a predictor variable. Columns containing unique identifiers, such as surrogate primary keys and natural primary keys, should not be used in statistical analyses. They have no predictive value and can cause runtime exceptions.

Select Customize to modify the Model, Cross-validation, and Plots settings.

Customize the Model

  • Omit a model constant: Select to omit a constant and have the best fit line pass through the origin.

  • Use a weight variable for weighted least squares: Select a variable to determine the amount of importance to place on each record when creating a least-squares model.

  • Use regularized regression: Select to balance the same minimization of sum of squared errors with a penalty term on the size of the coefficients and produce a simpler model.

    • Enter value of alpha: Select a value between 0 (ridge regression) and 1 (lasso) to measure the amount of emphasis given to the coefficient.

    • Standardize predictor variables: Select to make all variables the same size based on the algorithm used.

    • Use cross-validation to determine model parameters: Select to perform cross-validation and obtain various model parameters

      • Number of folds: Select the number of folds to divide the data. A higher number of folds results in more robust estimates of model quality, but fewer folds make the tool run faster.

      • What type of model: Select the type of model to determine the coefficients.

        • Simpler model

        • Model with lower in sample standard error

      • Set seed: Select to ensure the reproducibility of cross-validation and select the value of the seed used to assign records to folds. Choosing the same seed each time the workflow is run guarantees that the same records will be in the same fold each time. The value must be a positive integer.

Customize the Cross-Validation

  • Use cross-validation to determine estimates of model quality: Select to perform cross-validation and obtain various model quality metrics and graphs. Some metrics and graphs will be displayed in the static R output, and others will be displayed in the interactive I output.

    • Number of folds: Select the number of folds to divide the data. A higher number of folds results in more robust estimates of model quality, but fewer folds make the tool run faster.

    • Number of trials: Select the number of times to repeat the cross-validation procedure. The folds are selected differently in each trial, and the overall results are averaged across all the trials. A higher number of folds results in more robust estimates of model quality, but fewer folds make the tool run faster.

    • Set seed: Select to ensure the reproducibility of cross-validation and select the value of the seed used to assign records to folds. Choosing the same seed each time the workflow is run guarantees that the same records will be in the same fold each time. The value must be a positive integer.

Customize the Plots

  • Graph resolution: Select the resolution of the graph in dots per inch: 1x (96 dpi), 2x (192 dpi), or 3x (288 dpi).

    • Lower resolution creates a smaller file and is best for viewing on a monitor.

    • Higher resolution creates a larger file with better print quality.

  • Display graphs: Select to display graphs when using regularized regression.

View the Output

Connect a Browse tool to each output anchor to view results.

  • O (Output): Displays the model name and size of the object in the Results window.

  • R (Report): Displays a summary report of the model that includes a summary and plots.

  • I (Interactive): Displays a dashboard of interactive visualizations to support further data-discovery and model exploration.

Configure the Tool for In-Database Processing

The Linear Regression tool supports Oracle, Microsoft SQL Server 2016, and Teradata in-database processing. Visit In-Database Overview for more information about in-database support and tools.

When a Linear Regression tool is placed on the canvas with another In-DB tool, the tool automatically changes to the In-DB version. To change the version of the tool, right-click the tool, point to Choose Tool Version, and click a different version of the tool. Visit Predictive Analytics for more about predictive in-database support.

Connect an Input

Connect an in-database data stream that includes a target field of interest along with one or more possible predictor fields.

If the input is from a SQL Server or Teradata in-database data stream, then the Microsoft Machine Learning Server rxLinMod function (from the RevoScaleR package) is used for model estimation. This allows the processing to be done on the database server, as long as both the local machine and the server have been configured with Microsoft Machine Learning Server, and can result in a significant improvement in performance.

If the input is from an Oracle in-database data stream, then the Oracle R Enterprise ore.lm function (from the OREmodels package) is used for model estimation. This allows the processing to be done on the database server, as long as both the local machine and the server have been configured with Oracle R Enterprise, and can result in a significant improvement in performance.

For an in-database workflow in an Oracle database, full functionality of the resulting model object downstream only occurs if the Linear Regression tool is connected directly from a Connect In-DB tool with a single full table selected, or if a Write Data In-DB tool is used immediately before the Linear Regression tool to save the estimation data table to the database. Oracle R Enterprise makes use of the estimation data table to provide full model object functionality, such as calculating prediction intervals.

Configuration

  • Model name: Each model needs to be given a name so it can later be identified. The choice is to either provide a name or to have a name automatically generated. Model names must start with a letter and may contain letters, numbers, and the special characters period (".") and underscore ("_"). No other special characters are allowed, and R is case sensitive.

  • Select the target variable: Select the field from the data stream you want to predict.

  • Select the predictor variables: Choose the fields from the data stream you believe "cause" changes in the value of the target variable. Columns containing unique identifiers, such as surrogate primary keys and natural primary keys, should not be used in statistical analyses. They have no predictive value and can cause runtime exceptions.

  • Omit a model constant: Check this item if you want to omit a constant from the model. This should be done if there is an explicit reason for doing so.

  • Use sampling weights for model estimation: Check the check box and then select a weight field from the data stream to estimate a model that uses sampling weight. A field is used as both a predictor and the weight variable, then the weight variable will appear in the model call in the output with the string "Right_" prepended to it.

  • Oracle specific options: This option allows for the configuration of additional options only relevant for the Oracle platform.

    • Save the model to the database: Causes the estimated model object to be saved in the database, and is recommended so that the model objects and estimation tables live together in a centralized location in the Oracle database.

  • Teradata specific configuration: Microsoft Machine Learning Server needs additional configuration information about the specific Teradata platform to be used – in particular, the paths on the Teradata server to R's binary executables, and the location where temporary files that are used by Microsoft Machine Learning Server can be written. This information will need to be provided by a local Teradata administrator.

View the Output

Connect a Browse tool to each output anchor to view results.

  • O (Output): Displays the model name and size of the object in the Results window.

  • R (Report): Displays a summary report of the model that includes a summary and plots.