Use Text Pre-processing to clean up text data:
- Convert words to their roots (in other words, lemmatize).
- Filter out unwanted digits, punctuation, and stop words.
The Text Pre-processing tool has 3 anchors.
- Green input anchor: Use the green input anchor on the top to connect the text data you want to process.
- Gray input anchor: Use the gray input anchor on the bottom to pass in a list of stop words from a list. We recommend using CSV format, but the list can be in any input format so long as the stop words are listed in a single column with 1 word per row.
- Output anchor: Use the output anchor to pass the data you've processed downstream.
Configure the Tool
Add a Text Pre-processing tool to the canvas.
Use the anchor to connect the Text Pre-processing tool to the text data you want to use in the workflow.
Identify the Language of the data.
Select the Text Field you want to use.
Run the workflow.
The Text Pre-processing tool has some advanced options
To convert words to their roots, check the box for Convert to Word Root (Lemmatize).
This option transforms derivative words into their root words. For example, the words "running," "ran," and "runs" all become the word "run" after you lemmatize them. That way, when you apply a machine-learning algorithm to analyze the words, the machine is able to recognize that all those words should be grouped together.
To remove digits, check the box for Digits. This option removes certain digit tokens (in other words, numbers) from the data. You might want to select this option because numbers can confuse some Natural Language Processing algorithms. Some digit tokens—such as the period in "Mrs."—are kept because they are meaningful.
To remove punctuation, check the box for Punctuation. This option removes punctuation from the data. You might want to select this option because punctuation can confuse some NLP algorithms.
To remove stop words, check the box for Stop Words. Some stop words are removed by default. The Text Pre-processing tool uses the package spaCy as the default. spaCy has different lists of stop words for different languages. You can see the full list of stop words for each language in the spaCy GitHub repo:
You can also remove stop words that aren't removed by default. Enter the stop words you want to remove in the text field. Enter them in comma-separated format (in other words, separate each word with comma and a space, in that order).
In the results grid, the tool creates a new column in the data with the name of the column you processed plus the signifier "_processed."