Use Topic Modeling to identify and categorize topics in a body of text. Consider using the Text Pre-processing tool upstream before passing data into the Topic Modeling tool.
This tool is part of Alteryx Intelligence Suite. Intelligence Suite requires a separate license and add-on installer to Designer. After you install Designer, install Intelligence Suite and start your free trial.
The Topic Modeling tool supports English, French, German, Italian, Portuguese, and Spanish.
The Topic Modeling tool has 3 anchors:
- Input anchor: Use the input anchor to connect the text data you want to analyze.
- D anchor: Use the D anchor to pass the data you've analyzed downstream.
- R anchor: Use the R anchor to view a report of the analysis.
- M anchor: Use the M anchor to pass the model object downstream for use with new data. The model object is compatible with the Predict tool.
Configure the Tool
- Add a Topic Modeling tool to the canvas.
- Use the anchor to connect the Topic Modeling tool to the text data you want to use in the workflow.
- Select the Text Field you want to analyze.
- Specify the Number of Topics you want to model.
- In the Output Options section, select the kind of output you want in the R anchor:
- The Interactive Chart option generates an interactive report that includes two charts: Top-30 Most Salient Terms and Intertopic Distance Map.
- The Word Relevance Summary option generates a static report with measures of each term's salience to the model and relevance to each topic.
- The Dictionary Options and LDA Options are at their default values. For more information about these options, see the Advanced Options section below.
- Run the workflow.
This tool uses latent Dirichlet allocation (LDA) to identify topics. Here are some resources about the LDA algorithm and the concepts of saliency and relevance.
The Topic Modeling tool has some advanced options.
|Min Frequency||Min Frequency is the minimum frequency at which a word can appear in a body of text before the Topic Modeling tool ignores the word, where frequency is measured by the number of documents containing a word divided by the total number of documents in the body of text.||
|Max Frequency||Max Frequency is the maximum frequency at which a word can appear in a body of text before the Topic Modeling tool ignores the word, where frequency is measured by the number of documents containing a word divided by the total number of documents in the body of text.||
|Max Words||Max Words specifies how many words you want the Topic Modeling tool's algorithm to consider, based on how frequently the words appear across all the documents.||
|Alpha||Alpha represents the density of topics the algorithm should expect in each document. Increasing Alpha allows the algorithm to recognize a greater number of distinct topics in a document. Decreasing Alpha limits the number of topics the algorithm recognizes in each document.||Number||None|
|Eta||Eta represents the density of words needed to make up a topic. Increasing Eta increases the number of words needed to identify a topic. Decreasing Eta reduces the number of words needed to identify a topic.||Number||>= 0|
The D anchor outputs a new column for each topic. The columns represent the degree to which each topic is present in the text associated with each row. A higher value in the topic column indicates a greater probability the text associates with that topic. The R anchor outputs one of two reports based on your selection:
- Interactive Chart: Returns an interactive visualization of the model that you can view with a Browse tool. The Interactive Chart has 2 parts, a map with the distance between the topics, and metrics for evaluation. The Intertopic Distance map shows how similar the identified topics are to each other.
- Word Relevance Summary: Returns the words included in the topic model as well as Relevance and Saliency metrics. Saliency is how prominent the word is in the overall text. Relevance is a metric used to order words within topics and helps to identify the appropriate words for each topic. The higher the value for a given topic, the more important that word is for that topic.
The M anchor outputs a model object downstream for use with new data. The model object is compatible with the Predict tool.