Text Pre-processing Tool
Use Text Pre-processing to clean up text data:
- Convert words to their roots (in other words, lemmatize).
- Filter out unwanted digits, punctuation, and stop words.
The Text Pre-processing tool has two anchors
- Input anchor: Use the input anchor to connect the text data you want to process.
- Output anchor: Use the output anchor to pass the data you've processed downstream.
Configure the Tool
Add a Text Pre-processing tool to the canvas.
Use the anchor to connect the Text Pre-processing tool to the text data you want to use in the workflow.
Identify the Language of the data.
Select the Text Field you want to use.
Run the workflow.
The Text Pre-processing tool has some advanced options
To convert words to their roots, check the box for Convert to Word Root (Lemmatize).
This option transforms derivative words into their root words. For example, the words "running," "ran," and "runs" all become the word "run" after you lemmatize them. That way, when you apply a machine-learning algorithm to analyze the words, the machine is able to recognize that all those words should be grouped together.
To remove digits, check the box for Digits. This option removes certain digit tokens (in other words, numbers) from the data. You might want to select this option because numbers can confuse some Natural Language Processing algorithms. Some digit tokens—such as the period in "Mrs."—are kept because they are meaningful.
To remove punctuation, check the box for Punctuation. This option removes punctuation from the data. You might want to select this option because punctuation can confuse some NLP algorithms.
To remove stop words, check the box for Stop Words. Some stop words are removed by default. The Text Pre-processing tool uses the package spaCy as the default. spaCy has different lists of stop words for different languages. You can see the full list of stop words for each language in the spaCy GitHub repo:
You can also remove stop words that aren't removed by default. Enter the stop words you want to remove in the text field. Enter them in comma-separated format (in other words, separate each word with comma and a space, in that order).
In the results grid, the tool creates a new column in the data with the name of the column you processed plus the signifier "_processed."