Distribution Analysis Tool
The Distribution Analysis tool allows you to fit one or more distributions to the input data and compare them based on a number of Goodness-of-Fit* statistics. Based on the statistical significance (p-values) of the results of these tests, the user can determine which distribution best represents the data.
The Distribution Analysis tool can be helpful when trying to understand the overall nature of your data as well as make decisions about how to analyze it. For instance, data that fits a Normal distribution would likely be well-suited to a Linear Regression, while data that is Gamma Distributed might be better-suited to analysis via the Gamma Regression tool.
This tool uses the R tool. Go to Options > Download Predictive Tools and sign in to the Alteryx Downloads and Licenses portal to install R and the packages used by the R Tool. See Download and Use Predictive Tools.
Configure the tool
Use the Configuration tab to set the mandatory controls for a distribution analysis.
- Select a field for analysis: Select a field from the incoming data for analysis.
- Select distributions for comparison: Select one or more distributions to compare. The distribution options are as follows:
- Normal: A commonly occurring continuous probability distribution that is often used in both the natural and social sciences to represent real-valued random variables (i.e. continuous random variables that can take both positive and negative values).
- Lognormal: A continuous probability distribution of a random variable whose logarithm is normally distributed. This distribution is well-suited to the description of natural phenomena such as growth rate and size distributions. In addition, it is often used to describe income distribution in a sufficiently large population.
- Weibull: A relatively flexible distribution that is closely related to the exponential distribution. It is frequently found in data that describes "failure" rates of some kind, e.g. random mechanical failure, mortality, churn, mechanical wear-out rates, etc.
- Gamma: A continuous probability distribution characterized by a significant concentration of cases at non-integer, non-negative lower values while also allowing for the reasonable possibility of much higher values. The Gamma distribution has a wide range of uses, and is commonly found in data that describes aggregate (or average) amounts per case, e.g. the average size of an insurance claim, measured per individual.
The Lognormal, Weibull, and Gamma distributions only work for non-negative data.
- Columns containing unique identifiers, such as surrogate primary keys and natural primary keys, should not be used in statistical analyses. They have no predictive value and can cause runtime exceptions.
Use the Graphics Options tab to set the controls for the graphical output.
- Plot size: Select inches or centimeters for the size of the graph.
- Graph resolution: Select the resolution of the graph in dots per inch: 1x (96 dpi); 2x (192 dpi); or 3x (288 dpi). Lower resolution creates a smaller file and is best for viewing on a monitor. Higher resolution creates a larger file with better print quality.
View the Output
A set of report snippets that includes a histogram, basic summary statistics of the test results, goodness of fit statistics, data quantiles per distribution, and the distribution parameters.
*D'Agostino, R., Stephens, M.A. (1986) Goodness of Fit Techniques.