Naive Bayes Classifier Tool
The Naive Bayes Classifier tool creates a binomial or multinomial probabilistic classification model of the relationship between a set of predictor variables and a categorical target variable. The Naive Bayes classifier assumes that all predictor variables are independent of one another and predicts, based on a sample input, a probability distribution over a set of classes, thus calculating the probability of belonging to each class of the target variable.
One of the main advantages of the Naive Bayes Classifier is that it performs well even with a small training set. This advantage derives from the fact that the Naive Bayes classifier is parameterized by the mean and variance of each variable independent of all other variables. In many maximum likelihood classification problems, the covariance matrix is needed in order to estimate predicted probabilities, but small training sets can lead to a highly variable covariance matrix which, in turn, can degrade the performance of the maximum likelihood estimator (MLE). Since the Naive Bayes classifier only requires the calculation of one-dimensional variances for each predictor, the covariance matrix is not needed and thus the MLE does not suffer from the problems of a small training set.
The Naive Bayes Classifier is useful when trying to categorize a set of observations according to a target "class" variable, particularly in cases where only a small training set and a small number of predictors are used. Using an initial training set, the Naive Bayes Classifier develops a model for predicting the probability that a given observation belongs to each class of the target variable.
A simple example would be predicting whether someone leasing a new vehicle will purchase that car at the termination of the lease based on both the characteristics of the vehicle or (e.g., pickup/sedan/SUV) and the customer (e.g., gender, age, etc.). The Naive Bayes Classifier would allow the user to "score" future individuals according to the model produced by the training set. This scoring process would result in a set of probabilities, one for purchase at the end of the lease agreement and one for not purchase at the end of the lease agreement.
This tool uses the R tool. Go to Options > Download Predictive Tools and sign in to the Alteryx Downloads and Licenses portal to install R and the packages used by the R Tool. See Download and Use Predictive Tools.
- Model name: Each model needs to be given a name so it can later be identified. Model names must start with a letter and may contain letters, numbers, and the special characters period (".") and underscore ("_"). No other special characters are allowed, and R is case sensitive.
- Select the target variable: Select the field from the data stream you want to predict. This target must be a string type.
- Select the predictor variables: Choose the fields from the data stream you believe "cause" changes in the value of the target variable.
Columns containing unique identifiers, such as surrogate primary keys and natural primary keys, should not be used in statistical analyses. They have no predictive value and can cause runtime exceptions.
- Laplace Smoothing: Choose a positive value as a smoothing parameter. The default is set to 0. The Laplace Smoothing feature allows the user to "smooth" the data by accounting for class/feature combinations that may either be entirely absent from the training set, or are otherwise under-represented in frequency and would therefore be assigned a probability that is either zero or, at the very least, uncharacteristically low (depending the circumstances). This is useful when attempting to build a classification model using a small training set that may not constitute a sufficiently representative sample of the population.
- Graph resolution: Select the resolution of the graph in dots per inch: 1x (96 dpi); 2x (192 dpi); or 3x (288 dpi). Lower resolution creates a smaller file and is best for viewing on a monitor. Higher resolution creates a larger file with better print quality.
View the output
- O anchor: Object. Consists of a table of the serialized model with its model name.
- R anchor: Report. Consists of the report snippets generated by the Naive Bayes Classifier tool: a basic model summary, as well as main effect plots for each class of the target variable.