Text Classification

The Text Classification tool trains and outputs a text classification model based on your training data. Connect the model to the Predict tool to classify your new unseen text data.

Alteryx Intelligence Suite Required

This tool is part of Alteryx Intelligence Suite. Intelligence Suite requires a separate license and add-on installer to Designer. After you install Designer, install Intelligence Suite and start your free trial.

Language Support

The Text Classification tool supports English, French, German, Italian, Portuguese, and Spanish.

Tool Components

The Text Classification tool has 4 anchors (2 inputs and 2 outputs):

T input anchor: Use the T input anchor to connect your training data. The training data must have a column with text and a column with the text’s label.
V input anchor: Use the V input anchor to connect validation text and labels.
M output anchor: Use the M output anchor to pass the model you've built downstream. Use your model with the Predict tool.
E output anchor: Use the E output anchor to collect evaluation metrics of your model.

Configure the Tool

Add a Text Classification tool to the canvas.
Connect the T input anchor to your training data. Then configure the Training Text settings:
1. Select the Column with Text that contains the training text data.
2. Select the Column with Labels that contains the labels for your training text data.
Connect the V input anchor to your validation data. Then configure the Validation settings:
1. Select the Column with Text that contains the validation text data.
2. Select the Column with Labels that contains the labels for your validation text data.
Configure the Advanced Options to match your use case. Refer to the next section for details.
Run the workflow.

Important

Note that your columns must be a String data type.

Advanced Options

Choose the Algorithm you want to use for your model:

Auto Mode
Multinomial Naive Bayes
Linear SVC

Auto Mode

Search for an optimal model between the model algorithms available. Auto Mode selection leverages the Multinomial Naive Bayes and Linear SVC algorithms. For each model, it searches through a small range of corresponding parameters. Auto Mode then outputs the optimal combination of algorithm and hyperparameters. To fine-tune your model, choose one of the algorithm-specific dropdowns.

Multinomial Naïve Bayes

The multinomial Naïve Bayes algorithm is a probabilistic classification model. The Naïve Bayes classifier builds a model that predicts the probability that a piece of text belongs to a label. To build your model, use training data in the form of rows of text and their associated labels (also known as classes or target). The algorithm assumes that all features are independent of one another. The advantages of the Naïve Bayes classifier are that it is scalable and generally performs well with a small training set.

Alpha

Alpha is an additive smoothing parameter that you can use to control model complexity. A value of 0 indicates no smoothing. A value greater than 0 might improve your results if a word in the test data doesn't exist in the training data.

The tool searches for the best model based on a range of Alpha values that you define. To create these Alpha values, enter the range you want to search (From–To) and the Number of Steps within that range.

Example 1

From = 0, To = 1, Number of Steps = 5 → Create these Alpha values for the model to try: [0, 0.25, 0.5, 0.75, 1].

Example 2

From = 0, To = 1, Number of Steps = 2 → Create these Alpha values for the model to try: [0, 1].

Cross-Validation

Cross-Validation is a resampling technique that uses different portions (or folds) of your data for model training and validation. Choose how many folds to use during cross-validation.

Term Frequency-Inverse Document Frequency (TF-IDF)

The need to convert raw text into numerical data is a required step for text classification. This vectorization step allows the model to interpret your data. For the Text Classification tool, we use a Term Frequency-Inverse Document Frequency (TF-IDF) vectorization technique. These are the TF-IDF settings:

Analyzer
Choose to create features from words (word) or characters (char) based on your input text.
Min. Document Frequency
Enter the minimum frequency of allowable terms in your text data. The tool won't add terms below this frequency to the algorithm's vocabulary.

Linear SVC

Linear SVC belongs to the Support Vector Machine class of models. You can apply this algorithm to data with 2 (binary) or more classes. Once fitted to your data, the model finds the best hyperplane that divides your data into the correct categories. Linear SVC is effective in high dimensional space such as text, however, it might be slow when applied to a large training dataset.

Penalty

Choose the norm used in the penalization. Note that the L2 norm (also known as the Euclidean norm) is the standard used in Support Vector Classification. The L1 norm results in sparse coefficient vectors.

Loss

Choose a loss function. Hinge is the standard choice for this algorithm.

C (Log Range)

C is a regularization parameter. It must be greater than 0. Large values of C correspond to less regularization and a model that attempts a close fit to the training data. In contrast, small values of C correspond to increased regularization.

The tool searches for the best model based on a range of C values that you define. To create these C values, enter the log range you want to search (From–To) and the Number of Steps within that range.