You are here: Predictive Analytics > Data Investigation > Distribution Analysis

# Distribution Analysis Tool

The Distribution Analysis tool allows you to fit one or more distributions to the input data and compare them based on a number of Goodness-of-Fit* statistics. Based on the statistical significance (p-values) of the results of these tests, the user can determine which distribution best represents the data.

The Distribution Analysis tool can be helpful when trying to understand the overall nature of your data as well as make decisions about how to analyze it. For instance, data that fits a Normal distribution would likely be well-suited to a Linear Regression, while data that is Gamma Distributed might be better-suited to analysis via the Gamma Regression tool.

This tool uses the R programming language. Go to Options > Download Predictive Tools to install R and the packages used by the R Tool.

## Input

An Alteryx data stream with continuous data.

## Configuration Properties

### Configuration

1. Select a field for analysis: Select a field from the incoming data for analysis.
2. Select distributions for comparison: Select one or more distributions to compare. The distribution options are as follows:
• Normal: A commonly occurring continuous probability distribution that is often used in both the natural and social sciences to represent real-valued random variables (i.e. continuous random variables that can take both positive and negative values).
• Lognormal: A continuous probability distribution of a random variable whose logarithm is normally distributed. This distribution is well-suited to the description of natural phenomena such as growth rate and size distributions. In addition, it is often used to describe income distribution in a sufficiently large population.
• Weibull: A relatively flexible distribution that is closely related to the exponential distribution. It is frequently found in data that describes "failure" rates of some kind, e.g. random mechanical failure, mortality, churn, mechanical wear-out rates, etc.
• Gamma: A continuous probability distribution characterized by a significant concentration of cases at non-integer, non-negative lower values while also allowing for the reasonable possibility of much higher values. The Gamma distribution has a wide range of uses, and is commonly found in data that describes aggregate (or average) amounts per case, e.g. the average size of an insurance claim, measured per individual.

Columns containing unique identifiers, such as surrogate primary keys and natural primary keys, should not be used in statistical analyses. They have no predictive value and can cause runtime exceptions.

The Lognormal, Weibull, and Gamma distributions ONLY work for non-negative data.

### Graphics Options

• Plot size: Configure the dimensions of the probability density graph that is created.
• Graph resolution: Select the resolution of the graph in dots per inch: 1x (96 dpi); 2x (192 dpi); or 3x (288 dpi). Lower resolution creates a smaller file and is best for viewing on a monitor. Higher resolution creates a larger file with better print quality.

## Output

A set of report snippets that includes a histogram, basic summary statistics of the test results, goodness of fit statistics, data quantiles per distribution, and the distribution parameters.

*D'Agostino, R., Stephens, M.A. (1986) Goodness of Fit Techniques.