The Count Regression tool creates a regression model that relates a non-negative integer value (0, 1, 2, 3, etc.) field of interest (a target variable) to one or more fields that are expected to have an influence on the target variable, and are often called predictor variables. Examples of common use cases are the number of visits a customers makes to a particular restaurant in a given month, or the number of phone numbers associated with a particular mobile telephone account. In these use cases, the use of a linear model results in biased estimates. The two most well known count regression models are Poisson* and negative binomial models**. Given a set of predictor variables, a count data regression model allows a user to obtain estimates of the expected number of events (e.g., store visits) for an observation unit (e.g., a customer) given a set of predictor fields.

The Poisson regression model makes a strong assumption about the relationship between the mean and variance of the target field (specifically that they equal one another). To account for this, the quasi-Poisson model has been developed. The Quasi-Poisson model allows for a variance that is different from the mean, but at the expense of not having defined information criteria measures (such as AIC), so a quasi-Poisson model cannot be used as the start for stepwise variable selection. The negative binomial regression model does have well defined information criteria and allows for a difference in the mean and variance for the underlying distribution, so will typically be preferred. It should be noted that a Poisson regression model estimated using data where the mean and variance differ from one another provides unbiased estimates of the mean and the corresponding model coefficients, but the tests of statistical significance are biased.

With this tool, if the input data is from a regular Alteryx data stream, then the open source R glm function is used for model estimation. If the input comes from either an XDF Output or XDF Input tool, then the Revo ScaleR rxGlm function is used for model estimation. The advantage of using the Revo ScaleR based function is that it allows much larger (out of memory) datasets to be analyzed, but at the cost of additional overhead to create an XDF file, the inability to create some of the model diagnostic output that is available with the open source R functions, and can only produce a Poisson regression model.

This tool uses the R programming language. Go to Options > Download Predictive Tools to install R and the packages used by the R Tool.

An Alteryx data stream or XDF metadata stream that includes a target field of interest along with one or more possible predictor fields.

- Model name: Each model needs to be given a name so it can later be identified. Model names must start with a letter and may contain letters, numbers, and the special characters period (".") and underscore ("_"). No other special characters are allowed, and R is case sensitive.
- Select the target variable: Select the field from the data stream you want to predict.
- Select the predictor variables: Choose the fields from the data stream you believe "cause" changes in the value of the target variable.
- Model type: Select Poisson, Quasi-poisson, or Negative binomial. If negative binomial is selected, the user can specify the value of theta (which is closely linked to the model variance). The best value of theta can be estimated from the data if the default "auto" option is used.
- Use sampling weights in model estimation (Optional): Click the check box and then select a weight field from the data stream to estimate a model that uses sampling weight. This option is not available if the selected model type is negative binomial and the value of theta is determined using the "auto" option, but will work for a specific value of theta is provided (which can be based on an initial run of the model that did not make use of sampling weights.)

Columns containing unique identifiers, such as surrogate primary keys and natural primary keys, should not be used in statistical analyses. They have no predictive value and can cause runtime exceptions.

Graph resolution: Select the resolution of the graph in dots per inch: 1x (96 dpi); 2x (192 dpi); or 3x (288 dpi). Lower resolution creates a smaller file and is best for viewing on a monitor. Higher resolution creates a larger file with better print quality.

- Output: Consists of a table of the serialized model with its model name.
- R Output: Consists of the report snippets generated by the Count Regression tool: a statistical summary, a Type II Analysis of Deviance (ANOD), and Basic Diagnostic Plots. The Type II Analysis of Deviance table and the Basic Diagnostic Plots are not produced when the model input comes from a XDF Output or XDF Input tool.

*en.wikipedia.org/wiki/Poisson_regression

**en.wikipedia.org/wiki/Negative_binomial_distribution

Â©2017 Alteryx, Inc., all rights reserved. AllocateÂ®, AlteryxÂ®, GuzzlerÂ®, and SolocastÂ® are registered trademarks of Alteryx, Inc.