Pearson Correlation Tool
The Pearson Correlation tool uses the Pearson product-moment correlation coefficient (sometimes referred to as the PMCC, and typically denoted by r) to measure the correlation (linear dependence) between two variables X and Y, giving a value between +1 and −1 inclusive. It is widely used in the sciences as a measure of the strength of linear dependence between two variables.*
Correlation (often measured as a correlation coefficient, ρ) indicates the strength and direction of a linear relationship between two random variables. Correlation values ranges from –1.00 (a perfect negative correlation) to +1.00 (a perfect positive correlation). Zero indicates no correlation at all.
The Pearson coefficient is obtained by dividing the covariance of the two variables by the product of their standard deviations.*
Configure the tool
- Generate correlation for selected variables: Select two or more fields from the input stream to run the correlation on. Fields must be numeric.
- Specify the type of calculation to run. Choices are:
- Calculate Correlation: Measures the Pearson Correlation.
- Calculate Covariance: Measures the Covariance between different fields. The type of covariance is "sample covariance", which is the same as the Excel statistical formula "COVARIANCES".
Columns containing unique identifiers, such as surrogate primary keys and natural primary keys, should not be used in statistical analyses. They have no predictive value and can cause runtime exceptions.
The Pearson Correlation Coefficient tool expects non-Null values. If there are nulls in the data, it is a good idea to use the Imputation Tool to replace the nulls first.
*http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient